Restic dedup strategy

daiquiri · September 7, 2017, 10:43am

Hi!

I was researching backup solutions when I’ve found restic because of the support of B2 cloud storage.

One thing I’ve noticed is that resting performs the worst when comparing deduplication performance according to this benchmark made by Duplicacy.

According to this restic wastes A LOT of more space making its dedup strategy almost useless when compared to attic and others? It’s ~2.5x bigger!

I would like to know the following before using restic:

Is this benchmark updated and does restic still under performs on deduplication?
Any plans to support the famous lock-free deduplication algorithm or to improve the current one?

Thank you!

rawtaz · September 7, 2017, 11:20am

Hi!

I cannot answer the question (someone else will), but my thought when reading your post was “why not try it anyway?”. Give restic a shot with some data, you decide how big it is, and see how it works out for you. Benchmarks isn’t everything

Either case, looking forward to a more on topic reply!

fd0 · September 7, 2017, 2:13pm

Hey, welcome to the forum!

I had a quick look at the benchmarks and it’s not impossible that they are accurate. restic does not support compression (yet) and uses JSON as the data storage format for meta data, which is famously verbose.

I’m sure that something else is going on here, maybe I can find some time to reproduce the benchmarks and have a look at what’s happening.

The benchmark was made with an older version of restic, but since we haven’t changed anything on the core data structures the same results can likely be observed with the most recent version.

I’ve had a look at the algorithm that dulicacy use and it’s an interesting idea, but it’s not possible to apply this directly to restic.

fd0 · September 7, 2017, 6:20pm

I’ve spent some time to look at the benchmarks and observed the same results. The main reason for the large restic repository size in the test case with the Linux kernel code is that restic does not compress the data and metadata yet. C source code can be compressed quite a lot.

I’ve quickly hacked a version of restic together that compresses most data with the “snappy” algorithm (moderate compression, very fast), and ended up with 853MiB repository size, which is quite similar to the other programs (duplicacy: 834MiB).

So, in order to improve the situation we need to add compression, which requires changing the repo format. This is tracked here: https://github.com/restic/restic/issues/21

daiquiri · September 7, 2017, 9:10pm

Thank you so much for explaining it. I was really curious because restic was the only one to diverge from the other solutions.

Is there any projection about the release of the new repo format? I took a look ate the issues and they have quite a few time. Just curious, I’ve been reading a few issue topics and love the way this project is managed.

Thank you once again.

fd0 · September 8, 2017, 6:13am

No, there isn’t. We (and particularly I) am currently busing with two different issues:

Getting paths in the repo right (that’s Unable to backup/restore files/dirs with same name · Issue #549 · restic/restic · GitHub)
Finishing a local metadata cache (see here for progress: Add local cache by fd0 · Pull Request #1040 · restic/restic · GitHub)

I’m very glad to hear that! Would you mind elaborating a bit what you particularly like, or what we do differently from other projects? It’d really help me and the other contributors

flamingm0e · September 9, 2017, 2:38am

I have been testing Restic, Borg, and Duplicacy as my potential backup solutions over the last couple of weeks.

Right now, I have the same data (a user home directory) being backed up to both restic and duplicacy repos. Each repo is on a different dataset on my NAS, so I can track their size.

Both of them are backing up the same data and I tried to create the same EXCLUDES for each, although there may be some inconsistencies in the excludes (duplicacy excludes are dead freakin simple).

My Restic dataset/repo is 50.1GiB while my Duplicacy dataset/repo is 50.0 GiB.

While this is not a ‘scientific’ test, or accurate by any means, I can say that they seem to be pretty much on par with each other. Speed is almost negligible between them. Restic takes slightly longer, but not enough to concern myself with on regular backups. Overall I am heavily leaning toward Duplicacy right now, but I do have a few months to decide.

fd0 · September 9, 2017, 8:27am

Very interesting, thanks for the data point! That’s real-life data (not artificially generated or just one subtype of file, like source code)?

I suspected that the test with the Linux kernel code is a bit of a special case, because there restic not having compression (yet) makes the repo grow in size so much.

Glad you found two solutions that work for you. I’d be interested in which you chose ultimately (and why)

flamingm0e · September 10, 2017, 1:18am

Yes, that is the home directory of my Fedora 26 desktop (configs, documents, pictures, etc). I have a bunch of excludes so there really is a lot more data than that.

fawick · September 14, 2017, 7:23am

@flamingm0e, I wonder whether the 50.1GB are the whole restic repository or only the data/ subdirectory? If it’s the whole repository, how much of it is data/ and how much is index/?

fd0 · July 20, 2018, 7:17am

I just stumbled on this thread, @flamingm0e what solution did you choose in the end and why? I’m curious

flamingm0e · July 21, 2018, 4:17am

In the end, I went with restic due to the capabilities and the “cloud” backend ability (and cross platform for me to back up my wife and son’s Windows machines!).

I set up my NAS with Minio and have that set as my backup target. I have had no issues with it.

Personally for me I didn’t like duplicacy’s restore process. I like the simplicity of mounting my restic repo as storage and browsing the files to restore what I need. I don’t have to use any command line for that.

Thank you for your hard work and dedication to the project. Its been working great for me. After coming up with some scripts for it, it “just works”.