Dedup only 0.3% efficient on 100% duplicate data

Ataraxy · July 30, 2019, 10:08am

I’m testing out the deduplication of restic.

I generated 128MB of data, in 16 x 8MB files:

for i in {00..15}; do dd if=/dev/urandom of=$i bs=1M count=8; done

Then I combine those files into a single 128MB file:

cat ?? > combined

I did a backup on the individual files, then on the combined file. The output of the combined run was:

% RESTIC_PASSWORD=X restic --verbose=4 backup --tag=test combined                                                                               1m16s | 19-07-30 16:47:59
open repository
repository 86b7fabe opened successfully, password is correct
lock repository
load index files
start scan on [combined]
start backup on [combined]
scan finished in 2.707s: 1 files, 128.000 MiB
new       /combined, saved in 26.064s (45.673 MiB added)

Files:           1 new,     0 changed,     0 unmodified
Dirs:            0 new,     0 changed,     0 unmodified
Data Blobs:     22 new
Tree Blobs:      1 new
Added to the repo: 45.679 MiB

processed 1 files, 128.000 MiB in 0:44
snapshot 46fd9426 saved

By my calculations, if 45.679 MiB was added, then 128.000 - 45.679 == 82.327 was reused.

Reusing 82.327 of 128.000MiB is 64.32% reuse.

I re-did the above test with 100 x 1MB files. This time I first added the combined file, then the 100 files. This was the output:

Files:         100 new,     0 changed,     0 unmodified
Dirs:            0 new,     0 changed,     0 unmodified
Data Blobs:    137 new
Tree Blobs:      1 new
Added to the repo: 99.734 MiB

This is only about 0.3% deup efficiency.

Interestingly, only one file got any deduplication:

new       /test/99, saved in 0.025s (751.531 KiB added)

This was the last of the 100 files (since the first was named 00), and coincided with the last 1024KiB of the combined file. All the other 99 files were (1024.000 KiB added).

Questions:

Are these ballpark expected figures?
Is there any way of getting these numbers closer to 100%?
In the 100 x 1MB test, why was the last file the only one to receive deduplication?

cdhowie · July 30, 2019, 3:28pm

Restic breaks apart files into chunks using a heuristic algorithm. I believe that the algorithm targets ~4-6MB chunks, so this would explain why the 1MB test did not perform as well as the 8MB test.

cfbao · July 30, 2019, 3:41pm

I believe the targeted chunk size is a bit smaller. But regardless, as long as the file size isn’t much larger than the chunk size, dedupe would perform much worse.

A (somewhat) visual explanation:

Many small files:
wzuxgwwf | qjxgozzm | igecavbj | ulsunwgu | jbtchmnp

A large file built by concatenating the above small files, then chunked by restic:
wzuxgwwfqj | xgozzmigeca | vbjulsunwg | ujbtchmnp

You can see that no blobs can be reused, because the small files’ sizes (8 char) are smaller than the target chunk size (~10 char), so the file boundaries terminate chunking prematurely (so to speak).

fd0 · July 31, 2019, 8:04am

Thanks @cdhowie and @cfbao for the explanations, you’re right: the block size in the examples is too small for restic to efficiently deduplicate. Here’s an article how restic splits data:

https://restic.net/blog/2015-09-12/restic-foundation1-cdc

Here’s a better example. Let’s say we have 100MiB of random data:

% dd if=/dev/urandom bs=1M count=100 of=/tmp/data
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 0,541764 s, 194 MB/s

Saving the file with restic makes the repo grow 101MiB (data+metadata):

% restic backup /tmp/data
repository b9852fa4 opened successfully, password is correct
created new cache in /home/fd0/.cache/restic
found 2 old cache directories in /home/fd0/.cache/restic, pass --cleanup-cache to remove them

Files:           1 new,     0 changed,     0 unmodified
Dirs:            1 new,     0 changed,     0 unmodified
Added to the repo: 100.005 MiB

processed 1 files, 100.000 MiB in 0:00
snapshot 348bb6c2 saved

% du -sh $RESTIC_REPOSITORY
101M	/tmp/restic-repo

Saving the same file again does not cause the repo to grow at all:

% restic backup /tmp/data
repository b9852fa4 opened successfully, password is correct
found 2 old cache directories in /home/fd0/.cache/restic, pass --cleanup-cache to remove them

Files:           0 new,     0 changed,     1 unmodified
Dirs:            0 new,     0 changed,     1 unmodified
Added to the repo: 0 B

processed 1 files, 100.000 MiB in 0:00
snapshot b707adf0 saved

% du -sh $RESTIC_REPOSITORY

101M	/tmp/restic-repo

Saving a file which contains the same data twice also does not increase the repo size:

% cat /tmp/data /tmp/data > /tmp/data2

% restic backup /tmp/data2
repository b9852fa4 opened successfully, password is correct
found 2 old cache directories in /home/fd0/.cache/restic, pass --cleanup-cache to remove them

Files:           1 new,     0 changed,     0 unmodified
Dirs:            1 new,     0 changed,     0 unmodified
Added to the repo: 891.970 KiB

processed 1 files, 200.000 MiB in 0:00
snapshot 516f7980 saved

% du -sh $RESTIC_REPOSITORY
101M	/tmp/restic-repo

If we recombine the data (add the string foo at the beginning, then append the first 50MiB of data from the first file), we can see the repo grows just a bit:

% (echo foobar; dd if=/tmp/data bs=1M count=50) > /tmp/data3
50+0 records in
50+0 records out
52428800 bytes (52 MB, 50 MiB) copied, 0,0227359 s, 2,3 GB/s

% restic backup /tmp/data3
repository b9852fa4 opened successfully, password is correct
found 2 old cache directories in /home/fd0/.cache/restic, pass --cleanup-cache to remove them

Files:           1 new,     0 changed,     0 unmodified
Dirs:            1 new,     0 changed,     0 unmodified
Added to the repo: 1.910 MiB

processed 1 files, 50.000 MiB in 0:00
snapshot be31f1cf saved

% du -sh $RESTIC_REPOSITORY
103M	/tmp/foo

I hope this helps in understanding restic’s deduplication algorithm a bit better!

laajleePa · May 17, 2020, 8:33pm

I know this is almost 10 months old topic, but wanted to add a relevant bench-marking where restic did not perform very well in terms of storage or required CPU time.

I am planning to move to Backblaze B2 for my personal backup (linux) and was thinking to use restic. However after going through these discussions and benchmarks, I am left bit confused.

Is there a clear use-case where restic makes more sense (or does not make sense)?
Are there restic benchmarks that is/was reviewed/maintained by restic authors?

My personal backup would be roughly ~1TB that includes several git repositories in addition to thousands of photos from cameras, music files etc.

rawtaz · May 17, 2020, 9:02pm

It all boils down to your needs and requirements. One thing I like about restic, besides its community, is that I can run it ad-hoc. I tried all of those four backup softwares, and for reasons settled on restic.

A very simple answer to your question could be (I say could, because there’s many different answers to it I guess) if you are deciding between Duplicacy and restic for commercial use, and notice that Duplicacy has " * Non-trial commercial use requires per-computer CLI licenses available from duplicacy.com at a cost of $50 per year" in it’s license terms. This will obviously make you pick restic out of the two unless you’re happy to pay for the software. While if you absolutely feel that you must have compression, which restic doesn’t have, you’d not pick restic.

Not that I know. Doesn’t mean much though, but at least there’s nothing official that’s endorsed by restic or similar, if there was we’d probably have it linked to somewhere more public.

I’m not going to comment on the way the benchmark you linked to is written

rawtaz · May 17, 2020, 9:03pm

I think a more general and better answer is that you should simply narrow your choices down to a couple or maybe three, based on features, and then try them. That way you’ll get a feel for them, how they are to use. I think that’s very important

laajleePa · May 17, 2020, 9:35pm

Thanks for the response.

I agree that the benchmark-link may not be very unbiased That is why I asked if there are official benchmark numbers for restic.

Although I am not looking for commercial use, Duplicacy’s license is a deal-breaker for me. It is not a familiar/common license and it does not give me enough confidence that the software will be maintained several yrs down the line.

laajleePa · May 17, 2020, 9:36pm

Yes. thanks for the suggestions. I will try a few and see what fits my need.

kellytrinh · June 3, 2020, 5:58am

A comparison and some benchmarking that I am working on can be found here

The table is a bit big/difficult to read on the forum format so you might find the excel source info below better. Given this dedup stress test is quite interesting I might add it as well when I have time.

https://f000.backblazeb2.com/file/backblaze-b2-public/Backup_Tool_Comparison.xlsx

kellytrinh · June 3, 2020, 7:00am

Just tested - looks like Borg/Arq both similar to Restic (so points to similar approach in deduping)

Two surprising piece - Duplicati is horrible; 0% efficiency (actually slight negative due to metadata) for both scenarios. OTOH, Duplicacy actually does well and manages high effiicency for both cases…

Excel in link updated (look in the formulas if curious on quantum of backup size change)

Addendum: tested out with file size as 4* 128Mb instead - that’s big enough the chunking dedup works and as a result instead of efficiency percentage in the high 90s .

nevermore · January 7, 2021, 11:00am

Official benchmark numbers for restic is also useful for me.