Dedup only 0.3% efficient on 100% duplicate data

I’m testing out the deduplication of restic.

I generated 128MB of data, in 16 x 8MB files:

for i in {00..15}; do dd if=/dev/urandom of=$i bs=1M count=8; done

Then I combine those files into a single 128MB file:

cat ?? > combined

I did a backup on the individual files, then on the combined file. The output of the combined run was:

% RESTIC_PASSWORD=X restic --verbose=4 backup --tag=test combined                                                                               1m16s | 19-07-30 16:47:59
open repository
repository 86b7fabe opened successfully, password is correct
lock repository
load index files
start scan on [combined]
start backup on [combined]
scan finished in 2.707s: 1 files, 128.000 MiB
new       /combined, saved in 26.064s (45.673 MiB added)

Files:           1 new,     0 changed,     0 unmodified
Dirs:            0 new,     0 changed,     0 unmodified
Data Blobs:     22 new
Tree Blobs:      1 new
Added to the repo: 45.679 MiB

processed 1 files, 128.000 MiB in 0:44
snapshot 46fd9426 saved

By my calculations, if 45.679 MiB was added, then 128.000 - 45.679 == 82.327 was reused.

Reusing 82.327 of 128.000MiB is 64.32% reuse.


I re-did the above test with 100 x 1MB files. This time I first added the combined file, then the 100 files. This was the output:

Files:         100 new,     0 changed,     0 unmodified
Dirs:            0 new,     0 changed,     0 unmodified
Data Blobs:    137 new
Tree Blobs:      1 new
Added to the repo: 99.734 MiB

This is only about 0.3% deup efficiency.

Interestingly, only one file got any deduplication:

new       /test/99, saved in 0.025s (751.531 KiB added)

This was the last of the 100 files (since the first was named 00), and coincided with the last 1024KiB of the combined file. All the other 99 files were (1024.000 KiB added).


Questions:

  • Are these ballpark expected figures?
  • Is there any way of getting these numbers closer to 100%?
  • In the 100 x 1MB test, why was the last file the only one to receive deduplication?

Restic breaks apart files into chunks using a heuristic algorithm. I believe that the algorithm targets ~4-6MB chunks, so this would explain why the 1MB test did not perform as well as the 8MB test.

I believe the targeted chunk size is a bit smaller. But regardless, as long as the file size isn’t much larger than the chunk size, dedupe would perform much worse.

A (somewhat) visual explanation:

Many small files:
wzuxgwwf | qjxgozzm | igecavbj | ulsunwgu | jbtchmnp

A large file built by concatenating the above small files, then chunked by restic:
wzuxgwwfqj | xgozzmigeca | vbjulsunwg | ujbtchmnp

You can see that no blobs can be reused, because the small files’ sizes (8 char) are smaller than the target chunk size (~10 char), so the file boundaries terminate chunking prematurely (so to speak).

Thanks @cdhowie and @cfbao for the explanations, you’re right: the block size in the examples is too small for restic to efficiently deduplicate. Here’s an article how restic splits data:

https://restic.net/blog/2015-09-12/restic-foundation1-cdc

Here’s a better example. Let’s say we have 100MiB of random data:

% dd if=/dev/urandom bs=1M count=100 of=/tmp/data
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 0,541764 s, 194 MB/s

Saving the file with restic makes the repo grow 101MiB (data+metadata):

% restic backup /tmp/data
repository b9852fa4 opened successfully, password is correct
created new cache in /home/fd0/.cache/restic
found 2 old cache directories in /home/fd0/.cache/restic, pass --cleanup-cache to remove them

Files:           1 new,     0 changed,     0 unmodified
Dirs:            1 new,     0 changed,     0 unmodified
Added to the repo: 100.005 MiB

processed 1 files, 100.000 MiB in 0:00
snapshot 348bb6c2 saved

% du -sh $RESTIC_REPOSITORY
101M	/tmp/restic-repo

Saving the same file again does not cause the repo to grow at all:

% restic backup /tmp/data
repository b9852fa4 opened successfully, password is correct
found 2 old cache directories in /home/fd0/.cache/restic, pass --cleanup-cache to remove them

Files:           0 new,     0 changed,     1 unmodified
Dirs:            0 new,     0 changed,     1 unmodified
Added to the repo: 0 B

processed 1 files, 100.000 MiB in 0:00
snapshot b707adf0 saved

% du -sh $RESTIC_REPOSITORY

101M	/tmp/restic-repo

Saving a file which contains the same data twice also does not increase the repo size:

% cat /tmp/data /tmp/data > /tmp/data2

% restic backup /tmp/data2
repository b9852fa4 opened successfully, password is correct
found 2 old cache directories in /home/fd0/.cache/restic, pass --cleanup-cache to remove them

Files:           1 new,     0 changed,     0 unmodified
Dirs:            1 new,     0 changed,     0 unmodified
Added to the repo: 891.970 KiB

processed 1 files, 200.000 MiB in 0:00
snapshot 516f7980 saved

% du -sh $RESTIC_REPOSITORY
101M	/tmp/restic-repo

If we recombine the data (add the string foo at the beginning, then append the first 50MiB of data from the first file), we can see the repo grows just a bit:

% (echo foobar; dd if=/tmp/data bs=1M count=50) > /tmp/data3
50+0 records in
50+0 records out
52428800 bytes (52 MB, 50 MiB) copied, 0,0227359 s, 2,3 GB/s

% restic backup /tmp/data3
repository b9852fa4 opened successfully, password is correct
found 2 old cache directories in /home/fd0/.cache/restic, pass --cleanup-cache to remove them

Files:           1 new,     0 changed,     0 unmodified
Dirs:            1 new,     0 changed,     0 unmodified
Added to the repo: 1.910 MiB

processed 1 files, 50.000 MiB in 0:00
snapshot be31f1cf saved

% du -sh $RESTIC_REPOSITORY
103M	/tmp/foo

I hope this helps in understanding restic’s deduplication algorithm a bit better!

3 Likes