I’m researching various backup software and I want to see how large can Restic scale in practice. I am looking at a 40-60TB project (total size, multiple clients, largest around 10-20TB) and want to figure out if people have prior experience regarding what kind of hardware will be necessary to handle such a load.
- How big is your original data set?
- How is the data churn (ie. how much data changes on every snapshot)?
- How big is the resulting Restic archive?
- How long does the backup take to complete?
- What is your prune schedule, and how long do those take?
- Did you ever try to restore? How long did that take? How long would a full restore take?
- What kind of hardware are you using, both client and server?
- What do you feel is the bottleneck (ram, cpu, network, …)?
I have heard of a 2PB dataset, for example, that is rotated every year (as opposed to prune, probably because of performance issues). How about your dataset? Anyone can confirm the 2PB figure or beat that? Sub-PB figures are also fine, of course.
Thanks in advance!
My largest data set is about 350GiB, it’s a backup of a server with several VMs. Roughly, between two and 10GiB change every day. The restic repo is about 600GiB in size. Since it’s a server with normal hard disks (no SSDs), backup takes about 30-60 Minutes (excluding LVM snapshots and fsck). Forget and prune are run manually, the settings are
--keep-daily 7 --keep-weekly 8 --keep-monthly 12, the last prune took a long time of ~22 hours for the sftp backend. I did not attempt a restore yet, just used the fuse mount a couple of times.
The hardware is an older Server machine with 2x3TB disks, 32GiB RAM. The storage server via sftp is a low-power Intel Microserver from 2012 or so.
At the moment the bottleneck for prune is the implementation, which is very conservative… that needs to be improved.
I remember that there was someone who wanted to backup their server farm with restic.
Can’t seem to find the post right now but I’ll keep looking maybe he has more info to share.
I run CourtListener.com where we have millions of PDFs amounting to several TBs of data. I’ve found a couple things in 0.8.3. (0.9.0 is supposed to be better, but I haven’t tested it yet.)
By far the slowest thing is iterating over the files. I think this is better in 0.9.0, but it was taking a very long time, around a day, just to do this.
Memory usage was a major problem. I caught Restic using dozens of GB of memory if I recall. That’s not normal for most backup software I’ve used.
Haven’t tried a restore or prune yet.
The fuse mount was unusably slow. Too bad — what a cool feature!
The backup endpoint I’m using is backblaze (because it’s cheap).
I need to do more research, but I’m kind of on the fence right now until I try 0.9.0. The memory usage and speed were real problems, but perhaps they are better in the latest version, and we’re upgrading our RAID array, which should help too.
Not sure how much this helps, but I think big picture Restic is a more complex system than something like rsnapshot. With that complexity comes features (yay!) and CPU/memory consumption (not yay!).
Indeed, from what I understand the archiver code was revamped in 0.9.0 which should improve backup times. But memory issues remain, as far as I know, the core issue, especially for prune.
Thanks for the data points and good luck!
Small update. We’re now using Restic 0.9.1 for our backups to backblaze, and we’re still having huge memory issues. Last night our backup failed due to using up all of our free memory (which was about 25GB). Seems like the issue for this is:
Do you have a lot of small files? (Just trying to understand the worst case scenario here)
One collection is 1.6TB with about 10M files and then we’ve probably got another couple million similar files in a separate collection. 10M doesn’t seem like a ton to me, but it sure doesn’t make restic happy.
I’m not entirely sure where the memory is used. I personally don’t have such a huge repository (and not the means to save it somewhere fast).
The new archiver code (which is used during backup by default since 0.9.0) does not use so much memory.
I have a hunch where it map happen: restic loads all index files into data structures in memory. At the moment, for each index file, a separate data structure is used (we need to change that). So the lookup if a specific blob is in the repo and where it is saved takes longer and I suspect it also uses much more memory. I’ll have a look at how to hold the data we need in a more efficient way, maybe even using a file on disc for seldomly used blobs.
processed 1547475 files, 8.319 TiB in 35:30
Daily backup on a QNAP TS-453A
This is kind of where my intuition went.
Do I understand your comment correctly to mean that as you have more snapshots, the amount of memory needed for a backup grows too? In that case, I guess part of my problem is that I disabled pruning due to its performance problems. If that’s right, would running just forget (without prune), help solve that issue?
Hm, I’m not sure that’s it. In general, the amount of memory seems to correlate to the number of index files, and therefore to the number of blobs in the repo. This can be caused by either a huge number of small files (=many small blobs) or just a huge amount of data (=many blobs).
As I said in the GitHub issue, I unfortunately don’t have a means to reproduce the memory issue locally, but I’m in the process of building a large repo here so that I can run some tests.
Gotcha. Thanks for looking into this. Let me know if I can help with providing millions of small files or something else like that.