Restic backup saving storage place ratio

ducpx · May 19, 2021, 3:03am

Hi everyone,

The recently day, I read the Content-Defined-Chunking algorithm helpful for saving storage place by deduplicate data. So, I want to know how many pecent data saved between CDC and without CDC in restic?

Thanks.

nicnab · May 19, 2021, 6:49am

I guess that depends on the data but I can give you a real world example: one of my customers runs a Nextcloud instance with a current data folder size of 299 gigs. The size of the restic drive including nearly a year worth of backups is 107 gigs.

They use a lot of presentations and media data like pictures and movies and I’m guessing those are reused quite often in different presentations but it seems like deduplication works quite well.

ducpx · May 19, 2021, 8:25am

Yep, I understand that depends on the data. With common data that enterprises need to backup, do you have any statistics for saving storage place?

nicnab · May 19, 2021, 9:12am

Not me, no. Maybe other people here? I have mainly small companies as my customers.

ducpx · May 19, 2021, 9:35am

Thank you. My boss wants a special number to show the effectiveness of the resolution.

nicnab · May 19, 2021, 9:51am

I’d be careful with that. I’m sure every data set is different. But in my experience, there’s always a LOT of duplicate data.

How much data are we talking about? Maybe it’s an option to invest 100 bucks into a large USB hard disk and just run restic backup to it to know what’s cooking in your specific case?

alexweiss · May 19, 2021, 10:42am

There are two things here:

restic splits large files into chunks using the CDC
each chunk (a.k.a blob) is then saved only once - this is the actual deduplication

The CDC chunker does guarantee that if you insert a small piece into a large file, only one or two chunks will be different.

However, much more important is that files with completely identical content are always split into the same chunks. This is not only the case when using a CDC, but would be also the case for splitting large files into blocks of a fixed size.

In my experience the fact that identical files are only stored once is most impacting for most real-life scenarios:

restic always “saves” a complete snapshot referencing all files. But by this deduplication, files that did not change w.r.t the last backup are not restored (so restic makes full backups but only needs the space requirement of an incremental backup)
this also holds for files that are moved or renamed - those are not saved again (which makes restic superior to incremental backups)
in real life, always duplicates of files exist, let it be two users having identical copies of files or programs doing their own “backup” by copying files. Those are all deduplicated.

So, the CDC only comes “on top” of the file duplication for cases where large files get small inserts or deletes.