I have a set of files that I receive updated versions for from time to time. They are sent to me in the form of a rar file.
I recently started backing up these rar files using restic.
Now I (thought I) knew that deduplication in restic works best with files that are not already compressed, so I thought it would be smart to lay the groundwork for future deduplication by decompressing the rar files and backing up the contained files directly.
I just ran the first backup of a directory with these uncompressed files and everything was deduplicated as if I had run the backups always this way.
There are only two older snapshots, both containing nothing but a single rar file each.
What I expected was for the repository size to double, as I thought that restic couldn’t access the files in the rar archives and thus would be unable to deduplicate them with the ones it had direct access to.
Can restic actually look into rar archives? If so, can it also access other compressed archives?
No, restic does not “look” into any compressed archives.
Whether files are deduplicated or not depends only on repeated content restic deduplication algorithm can detect. Deduplication is completely files’ format agnostic.
That’s what how I understood things to be. Maybe rar files are more transparent than I thought, or they were uncompressed. I don’t know enough about the format right now to explain how restic managed to deduplicate files within the archive.
A really basic idea of compression would be if you had a file with the string “foobar” repeated 10x, the it could be compressed to “foobar*10” instead of writing out foobar 10x.
Obviously this is a trivialised example, but thinking of compression like that, it should become more readily apparent why compressed files in general can be deduplicated.
Perhaps you were thinking of encryption? which doesn’t deduplicate well.
As an aside, you might find this topic interesting (although it is for compressed tar, not rar):