Errors in blogs and documentation

Great tool and good documentation, but …

The blog on Foundation - Introducing Content Defined Chunking (CDC) (restic · Foundation - Introducing Content Defined Chunking (CDC)) appears to contain wrong (or at least misleading) information: The tables just above the section “Examples” give three chunk boundaries, but provide fingerprints for only three of the four chunks. Adding a (first) row for the first chunk and showing two different fingerprints would make things much clearer. The offset for the end of the file in the second table is too low by 20.

In Working with repositories / Copying snapshots between repositories (Working with repositories — restic 0.14.0 documentation) the very first usage example has been adapted to reflect the new parameters introduced with release 0.14. All three examples further down still show “–repo2” instead of “–from-repo”.

This will be fixed in the next release: Working with repositories — restic 0.14.0 documentation .

I’m not sure I understand the problem. The fingerprints in the table are those used to determine the chunk boundaries and not the sha256 hashes of the chunks. Obviously, the sha256 hash of the first chunk will change, but stay the same for the other blocks. But for the rabin fingerprint, the whole point is that these specific fingerprints now show up at a different offset but didn’t change otherwise.

Sorry for the very late reply.

Let me try to provide the contents of the very two tables like I’d imagine them (to improve readers’ understanding):

One new first line for offset 0, because that’s what changes the most by adding 20 bytes to the file.

Offset Fingerprint
0 0x63d84be29a200000
577536 0x77db45c60d400000
1990656 0xc0da6ed30fe00000
2945019 0x309235f507600000
4194304 End of File

After adding 20 bytes at the beginning: Fingerprint for first chunk changed (duh); all other chunk offsets and the end of file increased by 20.

Offset Fingerprint
0 0x369a76b03c500000
577556 0x77db45c60d400000
1990676 0xc0da6ed30fe00000
2945039 0x309235f507600000
4194324 End of File

Why should there be a chunk boundary after the first byte? The chance is nearly zero that this happens happens for two first bytes. The fingerprints for the new 20 bytes and the following 64 bytes will be different, but afterwards all intermediate fingerprints are identical.

I mean we could extend the table to also include a few fingerprints which are not used to determine chunk boundaries. That would allow us to show how the fingerprints realign themselves are the first few bytes.

1 Like

The idea that “a chunk has a fingerprint” could be throwing you off here. It’s a chunk boundary - to be more precise, the 64 bits leading up to a boundary - that has the relevant fingerprint.

In the notation of the docs, a boundary will be present at offset N if F(the 8-byte sequence ending with byte N) has all its lowest 21 bits equal to zero. The table in the docs is showing you the values of N for which that’s the case in this example - these are the chunk boundaries.

So, if 20 bytes are added to the start of the file, all the existing chunk boundaries will still be valid (only they’ll occur 20 bytes later relative to the start of the file), and it’s very unlikely, though possible, that a new boundary appears within those first 20 bytes.

Thank you for your patience with me – I can now see where I mistook the contents of the tables as fingerprints of the chunks instead of fingerprints (of a sliding window) used to identify potential chunk boundaries. That being said, would it perhaps be better to swap the two columns? After all, it’s the occurence of a certain fingerprint pattern that defines a chunk boundary and not, as I falsely believed, the data between two offsets having any kind of fingerprint.

I would have ardently argued for 4194324 being the correct value for the “end of file” offset in the second table, but you already corrected that. Thanks again.

Swapping the two columns indeed sounds like a good idea. Do you want to create a PR for that?