Too good compression to be true

Hi guys!

We are testing restic for backup raw disks of virtual machine. Things are going well!)

We split backups of many projects by many different repos and now I try to analyze efficient of compression (with default level).
Usual I can see efficient of compression between 15-45% and it’s good and i think it’s correct.

But we have one repo with this stats:

restic stats -r /backups/8806f10d-ada0-44d1-a87a-3e9502032b67
enter password for repository:
repository a5d6c1a1 opened (version 2, compression level auto)
scanning...
Stats in restore-size mode:
     Snapshots processed:  36
        Total File Count:  36
              Total Size:  7.891 TiB
restic stats -r /backups/8806f10d-ada0-44d1-a87a-3e9502032b67 --mode raw-data
enter password for repository:
repository a5d6c1a1 opened (version 2, compression level auto)
created new cache in /root/.cache/restic
scanning...
Stats in raw-data mode:
     Snapshots processed:  36
        Total Blob Count:  5003453
 Total Uncompressed Size:  3.119 TiB
              Total Size:  531.810 GiB
    Compression Progress:  100.00%
       Compression Ratio:  6.01x
Compression Space Saving:  83.35%

6x ratio!

And my question is: It’s possible that restic do CDC in that way when one uniq blob has some amount of zeros which will be dropped by compression and rise compression ratio so high?

Wherry stupid example:

bytes flow:
abcde0000abcde00000

first iteration:
blob 1: abcde00
blob 2: 00abcde

second iteration:
blob 1: abcde0000
blob 2: abcde00

restic uses a minimum chunk size of 512 kB for its CDC. If only a small part of the chunk contains data and the rest is filled with zeros, that would allow the compression to easily reach such a compression ratio. The chunks created by restic for a virtual harddisk likely look similar to your second iteration example. That is they usually start with some data followed by a lot of zero bytes.

Are you able to check how much data the virtual machine thinks that it has stored on its filesystem? That would provide a good indication on how much data should be stored.

1 Like

@MichaelEischer thanks for the answer! You confirmed my thoughts.

Unfortunately we can see only real amount of bytes inside virtual volume, but we can’t see map of byte destribution like it provide qemu qcow2 format, for example. I don’t have access inside VM.

I think this case is very good example of compression efficient addition to dedup!
I know that was some scepsis about compression before implementation)

1 Like