Why am I seeing such poor dedup'ing?

nstgc · August 15, 2022, 7:48pm

I’ve been in the process of finding a backup program and the three that caught my eye are Bup, Borg, and Restic. Of the three, I’m generally most impress with Restic for it’s clean, easy to read code base, feature set, and UI. However, I’ve found the deduplication (to say nothing of a lack of compression, although that should change with 0.14) to be lacking.

I’ve mostly been testing on three data sets: my work (15GB of highly redundent, easily compressed data in files range from a few kB to several MB), my Steam compatdata directory (a mix of all kind of files since I also have non-Steam games installed there), and /usr/bin.

The most relevant for me is my work data. Compared to Bup (with no compression), Restic needs three times the space. In comparison to Borg (also uncompressed), Restic needs twice the space. These tests also made use of Btrfs snapshots to simulate regular use.

The difference is less marked for the compatdata directory, and minimal (but still there) for /usr/bin. The later really isn’t a good benchmark, but it’s a directory most of us have and which is presumably similar among systems.

At first I thought the cause stemmed from the chunk size, however, change the chunk size to match that of Bup (Could changing the hard-coded average chunking size break things? - #2 by alexweiss) didn’t make much of a difference. What’s more, the average chunk size used by Borg is twice that of Restic’s.

So I’m not sure what’s going on, or if I’m doing something wrong.

Is there something I can do to improve the efficiency of Restic’s duplicator, because I otherwise prefer it. Of course, if the space difference means I can’t fit everything in free storage space (B2’s free 10gigs for instance), then it doesn’t matter how much I like the program.

The script I used to test \usr\bin is:

#!/usr/bin/julia -O0

repos = homedir()*"/.usrtest"
target = "/usr/bin"

isdir(repos) ? rm(repos,recursive=true) : nothing
mkdir(repos)

### Functions with reused stuff for convincience.
bup(stuff) = setenv(`bup $(split(stuff))`,["BUP_DIR="*repos*"/1bup"])
borg(commands) = setenv(`borg $(split(commands))`,["BORG_REPO="*repos*"/2borg"])
restic(commands) = `restic --password-command="echo 1" -r $repos/3restic $(split(commands))`


### Initiate repos
run(bup("init"))
run(borg("init -e none"))
run(restic("init"))

### Generate a file index
err = IOBuffer()
run(pipeline(ignorestatus(bup("index $target")),stderr = err))
regx = r"Permission denied: '(.*)'"
bad = first.(getfield.(eachmatch(regx, read(seekstart(err),String)), :captures))
index = join(filter(x -> ! any(contains.(bad,x)),
	split(chomp(read(bup("index -p $target"),String)),'\n')
),' ')

# ignorestatus() is needed so the script doesn't stop when `bup index` tries
# reading something it's not supposed to. This can then be used as the target
# for the others so all programs are working on the same data.

### back stuff up using no compression
run(bup("save -0 -c $index"))
run(borg("create -sp -C none ::test2 $index"))
run(restic("backup -v $index"))

### Let's find out how much space they use.
run(`sync $repos`)
run(`du -s $(readdir(repos,join=true)) $target`);

rawtaz · August 16, 2022, 1:54pm

Thanks for the structured question. Can you please add some actual numbers though? Also the output from your du and the backup run(s) would be nice. I’d also be curious to see the du for the folders inside the restic repository. And why not include the restic version as well. Also, what filesystem is your repository location?

gurkan · August 17, 2022, 11:41am

Like @rawtaz said, some numbers would be good.

I’ll still throw a theory though If you have a lot of small files (around 512KiB or less), they are not getting split, so if there is nothing matching exactly to this blob, no dedup (if I understand correctly from the doc).

nstgc · August 24, 2022, 2:28pm

Sorry for the delay. I got side tracked and wanted to be thorough.

$ restic version
restic 0.13.1 compiled with go1.18.1 on linux/amd64
$ restic-13 version
restic 0.13.1 (v0.13.0-359-gf0bb4f87-dirty) compiled with go1.19 on linux/amd64

restic is from the Arch Linux repo and restic-13 was compiled by me with go run build.go. The (dirty) changes just point it to the chunker repo on my system which I made changes to result in smaller chunks. I have the patch for that below.

$ git diff -U1 932cc2^
diff --git a/chunker.go b/chunker.go
index 6676eba..46f8827 100644
--- a/chunker.go
+++ b/chunker.go
@@ -16,5 +16,5 @@ const (
     // MinSize is the default minimal size of a chunk.
-    MinSize = 512 * kiB
+    MinSize = 512 // * kiB
     // MaxSize is the default maximal size of a chunk.
-    MaxSize = 8 * miB
+    MaxSize = 16 * kiB // 8 * miB
 
@@ -107,3 +107,3 @@ func NewWithBoundaries(rd io.Reader, pol Pol, min, max uint) *Chunker {
             MaxSize:   max,
-            splitmask: (1 << 20) - 1, // aim to create chunks of 20 bits or about 1MiB on average.
+            splitmask: (1 << 13) - 1, // aim to create chunks of 20 bits or about 1MiB on average.
         },
@@ -133,3 +133,3 @@ func (c *Chunker) ResetWithBoundaries(rd io.Reader, pol Pol, min, max uint) {
             MaxSize:   max,
-            splitmask: (1 << 20) - 1,
+            splitmask: (1 << 13) - 1,
         },

The filesystem used is Btrfs. I do have LZO compression on, but that should be transparent to Restic, yeah?

$ du -sh ~/.test-dedup/*
263M	/home/nstgc5/.test-dedup/1bup
581M	/home/nstgc5/.test-dedup/2borg
937M	/home/nstgc5/.test-dedup/3restic

$ du -sh ~/.test-dedup/*/*
0	/home/nstgc5/.test-dedup/1bup/branches
1.9M	/home/nstgc5/.test-dedup/1bup/bupindex
16M	/home/nstgc5/.test-dedup/1bup/bupindex.meta
4.0K	/home/nstgc5/.test-dedup/1bup/config
4.0K	/home/nstgc5/.test-dedup/1bup/description
4.0K	/home/nstgc5/.test-dedup/1bup/HEAD
60K	/home/nstgc5/.test-dedup/1bup/hooks
4.0K	/home/nstgc5/.test-dedup/1bup/info
1.6M	/home/nstgc5/.test-dedup/1bup/logs
243M	/home/nstgc5/.test-dedup/1bup/objects
1.6M	/home/nstgc5/.test-dedup/1bup/refs
4.0K	/home/nstgc5/.test-dedup/2borg/config
580M	/home/nstgc5/.test-dedup/2borg/data
12K	/home/nstgc5/.test-dedup/2borg/hints.1613
1.3M	/home/nstgc5/.test-dedup/2borg/index.1613
4.0K	/home/nstgc5/.test-dedup/2borg/integrity.1613
4.0K	/home/nstgc5/.test-dedup/2borg/README
4.0K	/home/nstgc5/.test-dedup/3restic/config
885M	/home/nstgc5/.test-dedup/3restic/data
51M	/home/nstgc5/.test-dedup/3restic/index
4.0K	/home/nstgc5/.test-dedup/3restic/keys
0	/home/nstgc5/.test-dedup/3restic/locks
1.6M	/home/nstgc5/.test-dedup/3restic/snapshots

That’s with my current work directories as well as the Btrfs snapshots I’ve been keeping of them. Below is from just backuping up my working directories.

$ du -sh ~/.usrtest/*/*
0	/home/nstgc5/.usrtest/1bup/branches
624K	/home/nstgc5/.usrtest/1bup/bupindex
40K	/home/nstgc5/.usrtest/1bup/bupindex.meta
4.0K	/home/nstgc5/.usrtest/1bup/config
4.0K	/home/nstgc5/.usrtest/1bup/description
4.0K	/home/nstgc5/.usrtest/1bup/HEAD
60K	/home/nstgc5/.usrtest/1bup/hooks
4.0K	/home/nstgc5/.usrtest/1bup/info
19M	/home/nstgc5/.usrtest/1bup/objects
0	/home/nstgc5/.usrtest/1bup/refs
4.0K	/home/nstgc5/.usrtest/2borg/config
27M	/home/nstgc5/.usrtest/2borg/data
4.0K	/home/nstgc5/.usrtest/2borg/hints.5
164K	/home/nstgc5/.usrtest/2borg/index.5
4.0K	/home/nstgc5/.usrtest/2borg/integrity.5
4.0K	/home/nstgc5/.usrtest/2borg/README
4.0K	/home/nstgc5/.usrtest/3restic/config
27M	/home/nstgc5/.usrtest/3restic/data
360K	/home/nstgc5/.usrtest/3restic/index
8.0K	/home/nstgc5/.usrtest/3restic/keys
4.0K	/home/nstgc5/.usrtest/3restic/locks
300K	/home/nstgc5/.usrtest/3restic/snapshots

$ du -sh ~/.usrtest/*
20M	/home/nstgc5/.usrtest/1bup
27M	/home/nstgc5/.usrtest/2borg
28M	/home/nstgc5/.usrtest/3restic

Please note that the .usrtest doesn’t imply that I’m trying to backup/dedup /usr/. It’s just a script that I had used for that purpose points to that directory for repos.

They all seem to do about the same when the redundancy is low. This is something I hadn’t really checked before.

Below is the result of Restic compiled to use a smaller chunk size and initialized with --repository-version 1

$ restic-13 backup -vr .test-restic-13 ~/Work/
open repository
enter password for repository: 
repository 7e8afb98 opened (repository version 1) successfully, password is correct
created new cache in /home/nstgc5/.cache/restic
lock repository
no parent snapshot found, will read all files
load index files
start scan on [/home/nstgc5/Work/]
start backup on [/home/nstgc5/Work/]
scan finished in 0.366s: 2942 files, 26.939 MiB

Files:        2942 new,     0 changed,     0 unmodified
Dirs:         1355 new,     0 changed,     0 unmodified
Data Blobs:   4062 new
Tree Blobs:   1231 new
Added to the repository: 19.854 MiB (20.202 MiB stored)

processed 2942 files, 26.939 MiB in 0:01
snapshot f7cd372a saved

And repeating this with the version from the Arch repo:

$ restic backup -vr .test-restic ~/Work/
open repository
enter password for repository: 
repository c4febf9f opened successfully, password is correct
created new cache in /home/nstgc5/.cache/restic
lock repository
load index files
no parent snapshot found, will read all files
start scan on [/home/nstgc5/Work/]
start backup on [/home/nstgc5/Work/]
scan finished in 0.338s: 2942 files, 26.939 MiB

Files:        2942 new,     0 changed,     0 unmodified
Dirs:         1355 new,     0 changed,     0 unmodified
Data Blobs:   1855 new
Tree Blobs:   1231 new
Added to the repo: 26.741 MiB

processed 2942 files, 26.939 MiB in 0:01
snapshot 6891486b saved

As can be seen, Restic is chopping the files up into finer pieces. We can confirm that there is some space savings.

$ du -sh ~/.test-restic*
28M	/home/nstgc5/.test-restic
21M	/home/nstgc5/.test-restic-13

Doing the same with those snapshots:

$ restic init .test-restic
$ restic-13 init --repository-version 1 -r .test-restic-13
$ restic backup -vr .test-restic ~/Work/Snapshots/
open repository
enter password for repository: 
repository 80ec1906 opened successfully, password is correct
created new cache in /home/nstgc5/.cache/restic
lock repository
load index files
no parent snapshot found, will read all files
start scan on [/home/nstgc5/Work/Snapshots/]
start backup on [/home/nstgc5/Work/Snapshots/]
scan finished in 21.366s: 1113955 files, 15.234 GiB

Files:       1113955 new,     0 changed,     0 unmodified
Dirs:        501528 new,     0 changed,     0 unmodified
Data Blobs:  10702 new
Tree Blobs:  434331 new
Added to the repo: 852.270 MiB

processed 1113955 files, 15.234 GiB in 26:54
snapshot 297af141 saved

$ restic-13 backup -vr .test-restic-13 ~/Work/Snapshots/
open repository
enter password for repository: 
repository ca1319f6 opened (repository version 1) successfully, password is correct
created new cache in /home/nstgc5/.cache/restic
lock repository
no parent snapshot found, will read all files
load index files
start scan on [/home/nstgc5/Work/Snapshots/]
start backup on [/home/nstgc5/Work/Snapshots/]
scan finished in 33.982s: 1113955 files, 15.234 GiB

Files:       1113955 new,     0 changed,     0 unmodified
Dirs:        501528 new,     0 changed,     0 unmodified
Data Blobs:  27029 new
Tree Blobs:  434331 new
Added to the repository: 863.537 MiB (893.898 MiB stored)

processed 1113955 files, 15.234 GiB in 24:52
snapshot 4792dc4b saved

As can be seen here, for larger data sets with lots of redundancy, the smaller chunk size doesn’t help. I’d have confirmed this with du, but I Ctrl+R’d to rm -rf ~/.test-restic* instead. Oops. I’m getting a bit hasty.

Note that before I was using a script that mounted each snapshot into a fixed directory before having running restic backup. This might make a difference. It certainly is harder on my ~/.cache to do it this way.

And to give a sense of how redundent the data is:

$ sudo btrfs fi du -s ~/Work/Snapshots/
     Total   Exclusive  Set shared  Filename
  15.44GiB    69.97MiB   230.97MiB  /home/nstgc5/Work/Snapshots/

Note that btrfs fi du does not report compressed size, so it’s reasonable to compare the other results against this for the sake of deduplication.

edit: And I’ll reply to @gurkan later today. I do have something to say in reply (some thoughts on that hypothosis), but I need to take time to put my thoughts in order.

nstgc · August 24, 2022, 3:25pm

I don’t think the issue has to do with Restic not chopping up files. By default, Borg uses a chunk size twice that of Restic, and when I diced up the files finder with Restic, I didn’t get any significant improvement. In fact, it’s worse when testing against the larger data set.

Note that 13bits was choosen as this matches Bup’s chunker.

I feel that even if it’s merely identifying files that are identical without doing any chunking, I should be seeing better deduplication. I’d need to construct a new test to confirm this, however. I figure a script could start by putting all files in a snapshot into a tar files, then try finding those same files in the next snapshot and add only those files that are new or changed to the tar file. That’s something for later, however. There’s probably also a “stupid” deduplicater that does that for me, and more efficiently, but I think it would take more time to search for one than it would to just write a script myself.

MichaelEischer · August 24, 2022, 8:42pm

I think this is the explanation for the large repository size. Each directory entry is roughly 400 bytes or larger. Multiply that with the number of files plus directories (1.6 Million) and you’ll arrive at 600 MB just for the directory metadata. That is probably enough to explain the large size difference in the backup repositories. Most of it should be fixed by the compression support that will be added by restic 0.14.

nstgc · August 24, 2022, 10:30pm

Would this still hold true if instead of backing up the directory with the snapshots, if I were to have been mounting a snapshot, backing up the mounted snapshot, unmounting that, then mounting the next and so on? I typically use the following, but was in too much of a hurry this morning.

for snap in list_snaps()
	date = getfirst(r"(20\d{2}-[01]\d-[0-3]\d)\.snap$",snap)
	run(`sudo mount --bind $snap $target`)
	restic("backup -v $target",date)
	run(`sudo umount $target`)
end

The key part being mount and umount. That should result in 1/404 file directories, or there about. Right?

edit: Even then, aren’t the other problems having to store the same meta data?

edit2: From Backing up — restic 0.16.3 documentation

Metadata changes (permissions, ownership, etc.) are always included in the backup, even if file contents are considered unchanged.

So I guess the answer to my previous question is “no”. And that there is no way to turn that off.

edit3: I misread that. It saves metadata changes. The metadata shouldn’t be changing. I have noatime and most files are just sitting there doing nothing on any given day.

edit4: In any case, is there a way to test that?

MichaelEischer · August 25, 2022, 6:54pm

I’m not sure that only mounting one of the snapshots will reduce the directory metadata. After all if it could be deduplicated, then it already would be deduplicated. But restic 0.14.0 is out now, so you might want give the compression feature a try.