Compression implementation details

C-Otto · September 1, 2022, 3:50pm

I love the compression feature and I’m curious about its inner workings.

What happens if the same file is saved both with and without compression? How does deduplication work in these two cases (first compressed, then not compressed and vice versa)? I imagine that the plaintext hash is used in all cases where the data needs to be found (or it needs to be determined if it is already in the repository).

Related question, is it possible to use different compression levels for the same plaintext and still get deduplication?

fd0 · September 2, 2022, 5:16am

Hey Carsten!

you can read about how restic stores data in the design document, but that doesn’t answer all questions as some things are hidden in the implementation.

Exactly: restic splits a file into blobs (if it’s large enough), computes the sha256 hash of the blob and uses that as a lookup key to check if the data is already in the repo. If it isn’t, restic uploads the blob with the current compression settings (off for v1 repo, otherwise off, auto, max). This lookup does not care whether or not the blob is compressed, so it is valid for a file to have some blobs stored in compressed form and others without compression.

We had two ideas when writing the implementation:

The data already stored in old (v1) repositories should remain valid, so users can convert the repo to version 2 and enjoy compression without having to create a new repo (for which some users may not have the space) and download, compress and upload all data again. So the low-level storage format was extended to support compression, instead of replaced by a new format. We had intense discussions about the approach in #3666.
The compression level can be chosen for each run of restic backup anew. When I’m on vacation in a hotel with low upstream bandwidth, I can chose to use max compression and spend lots of CPU time locally, but save on upload time. When I’m about to leave the office (with massive upstream bandwidth), I don’t care about compression, but upload time is relevant, so I can set compression to auto or even `off.

Ideally, restic would only store a blob once in the repo. If it is uncompressed, you can use restic prune --repack-uncompressed to force its compression, but that’s it. If two separate instances of restic upload the same blob in parallel (restic backup can run in parallel to other instances, e.g. on two different hosts), restic prune would clean it up and remove the duplication.