I have two restic repositories (each ~ 2TB data) that have big overlap. Both were created independently, so they do not share the same chunker parameters.
I understand that I can copy snapshots from one repository to the other and that I will not gain deduplication between data from both repos.
Is there a way to re-chunk and deduplicate the data retrospectively? Maybe a mechanism similar to the migration to prune --repack-uncompressed?
Having bitten by this once, I wonder whether it would be good practice to init all my repositories with --copy-chunker-params…
Technically it’s possible. However, not implemented so far. The only way to chunk something is currently to run the backup command. So all you could do is access your saved data via restore or mount and run a backup on it.
Actually, IMO the best solution would be to implement two cases in the copy command:
chunking parameters of source and target repo are identical => only copy missing blobs
chunking parameters of source and target do not match => rechunk all files to copy
Moreover there could be a check in the check --read-data command which checks if the saved blobs are “valid chunks” with respect to the repo’s chunking parameter.
And - yes - there could be also an in-repo repair which re-chunks all files if the blobs are not " valid chunks". This would be however a completely different algorithm compared to prune. prune solely works on the blob level, whereas this would need to work on the tree level: Look for files to (possibly) re-chunk, do the re-chunk and then save the modified tree.
thank you very much for your reply. Re-chunking data during copy sounds like a good idea. I do not fully understand the implications but i sounds like it would add a lot of complexity to the copy process.
I’m now following your suggestion to mount and backup the data again. This works for me with two caveats:
I now have an additional path prefix (mountpoint + snapshot prefix). Not beautiful but also not a problem for me at all.
When backing up snapshot after snapshot from the restic-mounted filesystem I have to download data many times from the source repository. After three snapshots I tried to download the source repository as a whole and mount the local copy. This is faster by a factor of about 5 for my case.
But I wonder how restic init decides what parameters to use.
As I understand the chunkerparams are used to define the chunksizes which is needed for deduplication.
But I wonder why the should by different if I do a restic init.
restic uses Content Defined Chunking (CDC). The reason an irreducible polynomial is selected at random is security. If all restic repositories in existence would use the same value it could make life of potential attacker a wee easier.
If you plan to copy snapshots between your own repos then just make a habit of reusing this parameter by initialising your new repos based on older ones by using --copy-chunker-params flag.