Rar content got successfully deduplicated - but why?

I have a set of files that I receive updated versions for from time to time. They are sent to me in the form of a rar file.

I recently started backing up these rar files using restic.

Now I (thought I) knew that deduplication in restic works best with files that are not already compressed, so I thought it would be smart to lay the groundwork for future deduplication by decompressing the rar files and backing up the contained files directly.

I just ran the first backup of a directory with these uncompressed files and everything was deduplicated as if I had run the backups always this way.

There are only two older snapshots, both containing nothing but a single rar file each.

What I expected was for the repository size to double, as I thought that restic couldn’t access the files in the rar archives and thus would be unable to deduplicate them with the ones it had direct access to.

Can restic actually look into rar archives? If so, can it also access other compressed archives?

No, restic does not “look” into any compressed archives.

Whether files are deduplicated or not depends only on repeated content restic deduplication algorithm can detect. Deduplication is completely files’ format agnostic.

That’s what how I understood things to be. Maybe rar files are more transparent than I thought, or they were uncompressed. I don’t know enough about the format right now to explain how restic managed to deduplicate files within the archive.

I’ll look further into it.

A really basic idea of compression would be if you had a file with the string “foobar” repeated 10x, the it could be compressed to “foobar*10” instead of writing out foobar 10x.
Obviously this is a trivialised example, but thinking of compression like that, it should become more readily apparent why compressed files in general can be deduplicated.

Perhaps you were thinking of encryption? which doesn’t deduplicate well.

As an aside, you might find this topic interesting (although it is for compressed tar, not rar):

My little adventure there was before Restic added Zstandard compression natively, for the record. No reason to pipe through Zstd now.

I do still pipe tarballs into Restic, along with raw .img disk files, quite often. But I let Restic handle both the deduplication and compression.

To get back on topic, the only thing I can think of is that those RAR files were uncompressed?

That’s also the conclusion I came to.

It makes sense that they were left intentionally uncompressed, as the contained files are incompressible anyway.

It’s obvious in hindsight, I just never had this happen before.