Restic dedup strategy


#1

Hi!

I was researching backup solutions when I’ve found restic because of the support of B2 cloud storage.

One thing I’ve noticed is that resting performs the worst when comparing deduplication performance according to this benchmark made by Duplicacy.

According to this restic wastes A LOT of more space making its dedup strategy almost useless when compared to attic and others? It’s ~2.5x bigger!

I would like to know the following before using restic:

  1. Is this benchmark updated and does restic still under performs on deduplication?
  2. Any plans to support the famous lock-free deduplication algorithm or to improve the current one?

Thank you!


#2

Hi!

I cannot answer the question (someone else will), but my thought when reading your post was “why not try it anyway?”. Give restic a shot with some data, you decide how big it is, and see how it works out for you. Benchmarks isn’t everything :slight_smile:

Either case, looking forward to a more on topic reply!


#3

Hey, welcome to the forum!

I had a quick look at the benchmarks and it’s not impossible that they are accurate. restic does not support compression (yet) and uses JSON as the data storage format for meta data, which is famously verbose.

I’m sure that something else is going on here, maybe I can find some time to reproduce the benchmarks and have a look at what’s happening.

The benchmark was made with an older version of restic, but since we haven’t changed anything on the core data structures the same results can likely be observed with the most recent version.

I’ve had a look at the algorithm that dulicacy use and it’s an interesting idea, but it’s not possible to apply this directly to restic.


#4

I’ve spent some time to look at the benchmarks and observed the same results. The main reason for the large restic repository size in the test case with the Linux kernel code is that restic does not compress the data and metadata yet. C source code can be compressed quite a lot.

I’ve quickly hacked a version of restic together that compresses most data with the “snappy” algorithm (moderate compression, very fast), and ended up with 853MiB repository size, which is quite similar to the other programs (duplicacy: 834MiB).

So, in order to improve the situation we need to add compression, which requires changing the repo format. This is tracked here: https://github.com/restic/restic/issues/21


#5

Thank you so much for explaining it. I was really curious because restic was the only one to diverge from the other solutions.

Is there any projection about the release of the new repo format? I took a look ate the issues and they have quite a few time. Just curious, I’ve been reading a few issue topics and love the way this project is managed.

Thank you once again.


#6

No, there isn’t. We (and particularly I) am currently busing with two different issues:

I’m very glad to hear that! Would you mind elaborating a bit what you particularly like, or what we do differently from other projects? It’d really help me and the other contributors :slight_smile:


#7

I have been testing Restic, Borg, and Duplicacy as my potential backup solutions over the last couple of weeks.

Right now, I have the same data (a user home directory) being backed up to both restic and duplicacy repos. Each repo is on a different dataset on my NAS, so I can track their size.

Both of them are backing up the same data and I tried to create the same EXCLUDES for each, although there may be some inconsistencies in the excludes (duplicacy excludes are dead freakin simple).

My Restic dataset/repo is 50.1GiB while my Duplicacy dataset/repo is 50.0 GiB.

While this is not a ‘scientific’ test, or accurate by any means, I can say that they seem to be pretty much on par with each other. Speed is almost negligible between them. Restic takes slightly longer, but not enough to concern myself with on regular backups. Overall I am heavily leaning toward Duplicacy right now, but I do have a few months to decide.


#8

Very interesting, thanks for the data point! That’s real-life data (not artificially generated or just one subtype of file, like source code)?

I suspected that the test with the Linux kernel code is a bit of a special case, because there restic not having compression (yet) makes the repo grow in size so much.

Glad you found two solutions that work for you. I’d be interested in which you chose ultimately (and why) :slight_smile:


#9

Yes, that is the home directory of my Fedora 26 desktop (configs, documents, pictures, etc). I have a bunch of excludes so there really is a lot more data than that.


#10

@flamingm0e, I wonder whether the 50.1GB are the whole restic repository or only the data/ subdirectory? If it’s the whole repository, how much of it is data/ and how much is index/?