Is it possible to re-chunk after a restic copy?

qyz · March 18, 2023, 7:59am

Hi everyone,

I have two restic repositories (each ~ 2TB data) that have big overlap. Both were created independently, so they do not share the same chunker parameters.

I understand that I can copy snapshots from one repository to the other and that I will not gain deduplication between data from both repos.

Is there a way to re-chunk and deduplicate the data retrospectively? Maybe a mechanism similar to the migration to prune --repack-uncompressed?

Having bitten by this once, I wonder whether it would be good practice to init all my repositories with --copy-chunker-params…

Cheers,
qyz

alexweiss · March 19, 2023, 5:33am

Technically it’s possible. However, not implemented so far. The only way to chunk something is currently to run the backup command. So all you could do is access your saved data via restore or mount and run a backup on it.

Actually, IMO the best solution would be to implement two cases in the copy command:

chunking parameters of source and target repo are identical => only copy missing blobs
chunking parameters of source and target do not match => rechunk all files to copy

Moreover there could be a check in the check --read-data command which checks if the saved blobs are “valid chunks” with respect to the repo’s chunking parameter.
And - yes - there could be also an in-repo repair which re-chunks all files if the blobs are not " valid chunks". This would be however a completely different algorithm compared to prune. prune solely works on the blob level, whereas this would need to work on the tree level: Look for files to (possibly) re-chunk, do the re-chunk and then save the modified tree.

qyz · March 21, 2023, 12:27pm

Hi @alexweiss ,

thank you very much for your reply. Re-chunking data during copy sounds like a good idea. I do not fully understand the implications but i sounds like it would add a lot of complexity to the copy process.

I’m now following your suggestion to mount and backup the data again. This works for me with two caveats:

I now have an additional path prefix (mountpoint + snapshot prefix). Not beautiful but also not a problem for me at all.
When backing up snapshot after snapshot from the restic-mounted filesystem I have to download data many times from the source repository. After three snapshots I tried to download the source repository as a whole and mount the local copy. This is faster by a factor of about 5 for my case.

Cheers,
qyz

AlBundy · March 21, 2023, 8:26pm

according to the documentation
https://restic.readthedocs.io/en/latest/045_working_with_repos.html#ensuring-deduplication-for-copied-snapshots
it’s not possible to change chunker parameters of existing repos.

But I wonder how restic init decides what parameters to use.
As I understand the chunkerparams are used to define the chunksizes which is needed for deduplication.

But I wonder why the should by different if I do a restic init.

RYTD29 · February 26, 2024, 10:59am

Good question

kapitainsky · February 26, 2024, 12:33pm

restic uses Content Defined Chunking (CDC). The reason an irreducible polynomial is selected at random is security. If all restic repositories in existence would use the same value it could make life of potential attacker a wee easier.

If you plan to copy snapshots between your own repos then just make a habit of reusing this parameter by initialising your new repos based on older ones by using --copy-chunker-params flag.

You can read more here:

github.com

restic/chunker/blob/ac4c622f4b0836283d3b06c06b2ea87a976c7ca6/doc.go#L62-L79


      
          Background Literature
          
          An introduction to Rabin Fingerprints/Checksums can be found in the following articles:
          
          Michael O. Rabin (1981): "Fingerprinting by Random Polynomials"
          http://www.xmailserver.org/rabin.pdf
          
          Ross N. Williams (1993): "A Painless Guide to CRC Error Detection Algorithms"
          http://www.zlib.net/crc_v3.txt
          
          Andrei Z. Broder (1993): "Some Applications of Rabin's Fingerprinting Method"
          http://www.xmailserver.org/rabin_apps.pdf
          
          Shuhong Gao and Daniel Panario (1997): "Tests and Constructions of Irreducible Polynomials over Finite Fields"
          http://www.math.clemson.edu/~sgao/papers/GP97a.pdf
          
          Andrew Kadatch, Bob Jenkins (2007): "Everything we know about CRC but afraid to forget"
          http://crcutil.googlecode.com/files/crc-doc.1.0.pdf