The blog on Foundation - Introducing Content Defined Chunking (CDC) (restic · Foundation - Introducing Content Defined Chunking (CDC)) appears to contain wrong (or at least misleading) information: The tables just above the section “Examples” give three chunk boundaries, but provide fingerprints for only three of the four chunks. Adding a (first) row for the first chunk and showing two different fingerprints would make things much clearer. The offset for the end of the file in the second table is too low by 20.
In Working with repositories / Copying snapshots between repositories (Working with repositories — restic 0.14.0 documentation) the very first usage example has been adapted to reflect the new parameters introduced with release 0.14. All three examples further down still show “–repo2” instead of “–from-repo”.
I’m not sure I understand the problem. The fingerprints in the table are those used to determine the chunk boundaries and not the sha256 hashes of the chunks. Obviously, the sha256 hash of the first chunk will change, but stay the same for the other blocks. But for the rabin fingerprint, the whole point is that these specific fingerprints now show up at a different offset but didn’t change otherwise.
Why should there be a chunk boundary after the first byte? The chance is nearly zero that this happens happens for two first bytes. The fingerprints for the new 20 bytes and the following 64 bytes will be different, but afterwards all intermediate fingerprints are identical.
I mean we could extend the table to also include a few fingerprints which are not used to determine chunk boundaries. That would allow us to show how the fingerprints realign themselves are the first few bytes.
The idea that “a chunk has a fingerprint” could be throwing you off here. It’s a chunk boundary - to be more precise, the 64 bits leading up to a boundary - that has the relevant fingerprint.
In the notation of the docs, a boundary will be present at offset N if F(the 8-byte sequence ending with byte N) has all its lowest 21 bits equal to zero. The table in the docs is showing you the values of N for which that’s the case in this example - these are the chunk boundaries.
So, if 20 bytes are added to the start of the file, all the existing chunk boundaries will still be valid (only they’ll occur 20 bytes later relative to the start of the file), and it’s very unlikely, though possible, that a new boundary appears within those first 20 bytes.
Thank you for your patience with me – I can now see where I mistook the contents of the tables as fingerprints of the chunks instead of fingerprints (of a sliding window) used to identify potential chunk boundaries. That being said, would it perhaps be better to swap the two columns? After all, it’s the occurence of a certain fingerprint pattern that defines a chunk boundary and not, as I falsely believed, the data between two offsets having any kind of fingerprint.
I would have ardently argued for 4194324 being the correct value for the “end of file” offset in the second table, but you already corrected that. Thanks again.