Question about compression

I noticed that compression keeps coming up as a topic for restic. As far as I understand, fd0 confirmed in 2019 that restic will support compression at some point but, for now, it’s still on the to do list.

So, if compression is not available, then how come that the KVM qcow2 file I backed up with an original size of 70G only takes up some 20G in the repository? I mean, I understand that I allocated a lot more space to the VM than it currently takes up and that for the largest part, the file consists of zeroes. Still, without compression, how is restic able to reduce the snapshot size?

Restic only adds the parts of the files it backs up that were changed since the previous backup. This is called deduplication.

Imagine you have a large file, and then change only one tenth of that file. On the next backup, restic will upload and save that changed one tenth of the file, and for the rest of the file just reference the already/previously uploaded other nine parts.

Hi @rawtaz,

Thanks for your answer. I do know about restic’s deduplication feature. But this was the first backup made. The size of the qcow2 file is 70GB, the repository size (not some random snapshot size) is 20GB.

Deduplication is done on chunk level. Even at the first run. If your qcow2 file contains a lot of similar structures (or ‘nothing’ in case of free space inside) this will be deduplicated.
As you said that there is much free space inside the file, it all sounds valid to me.

@betatester77 That was my assumption, I just couldn’t find any clear answer on that in the documentation. Do you have an authoritative source you can reference?

@vic-t Check the link in my first reply. That’s the most authoritative source you can get regarding restic applying deduplication. As @betatester77 wrote, if your file contains a lot of zeroes, these will be extremely deduplicated.

1 Like

@rawtaz Honestly, I don’t see it. Here is the text you keep referring to:

For creating a backup, restic scans the source directory for all files, sub-directories and other entries. The data from each file is split into variable length Blobs cut at offsets defined by a sliding window of 64 byte. The implementation uses Rabin Fingerprints for implementing this Content Defined Chunking (CDC). An irreducible polynomial is selected at random and saved in the file config when a repository is initialized, so that watermark attacks are much harder.

Files smaller than 512 KiB are not split, Blobs are of 512 KiB to 8 MiB in size. The implementation aims for 1 MiB Blob size on average.

For modified files, only modified Blobs have to be saved in a subsequent backup. This even works if bytes are inserted or removed at arbitrary positions within the file.

I couldn’t even guess that this is supposed to mean deduplication is happening within one file already during the first backup run. My first instinct would be to shout out “English, please”. :wink: I’m glad it works at any rate. Thanks for the discussion to you both.

That’s fine. It just tells you the technical details about what we already told you above :slight_smile:

Basically, dedup is just compression on the macro scale :wink:

1 Like