Feedback from a large deployment

gurkan · February 8, 2020, 2:21pm

Hi there

I’ve tried to implement restic on a ~2000-node environment. Although it’s not used as “the” backup solution yet, it is very promising so far.

So I’d like to have some notes from this, hoping it would help understanding bottlenecks people might struggle. I am sure more people wrote similar stuff and I am repeating by not reading enough, like the saying:

A couple of months in the laboratory can frequently save a couple of hours in the library.

But here it is.

Also mandatory note: Take this feedback with grain of salt, since:

Restic is still the best backup tool I can see around.
Some of the mentioned issues below are specific to our use case, storage selection etc.
Community around the project is amazing

Acknowledgements:

Storage backend is Minio (one bucket per project, with very varying hosts per project)
Using a handmade scheduler, which triggers backup operations on host-specific predefined “right” hours.
Currently using a patched binary, due to reasons you’ll see below. I’d love to use the official binary, but not seems possible before some of the PR’s get merged.
Running restic via a wrapper python script.

Repository maintenance

It’s safe to say, I wasn’t expecting that the prune operation will be problematic. I even postponed to run forget & prune for long time, then realized I have to run it often.
More snapshots & data meaning slower operations, even listing the snapshots started to took 2 minutes at some point.

Currently running prune every 2 days.

Locking

If your repository is big enough, few hours of prune time is not surprising. With that many hosts, you either need to invent something which dynamically chooses best possible time to prune, or improvise. Otherwise you’ll need to “skip backup” (bad) or “backup on the wrong time” (worse) for a lot of clients.

Currently we have one failover repository for every project. After failing to back up due to the exclusive lock prune needs, the wrapper script sends the snapshot to this failover repository. After the prune operation, snapshots are copied[1] from failover repository to the main one.

Not the optimal solution, but seems to work.

[1] Using the patch adding copy command from this branch: https://github.com/middelink/restic/tree/fix-323

Memory consumption

That is the current haunting issue for my setup. For big repositories, restic easily grabs more than a GB memory, which is not acceptable for some situations.
If I catch a restic instance gets killed[1] by kernel due to memory pressure, I immediately stop all operations. But even that is not enough as you can imagine.

But there seems to be the light at the end of the tunnel, just tested a patch[2] which brings visible improvement.

[1] Which is a rare case that I can catch definitively, thanks to Linux. Seriously, restic needs something like this.
[2] Comparison of approaches to tackle index memory usage · Issue #2523 · restic/restic · GitHub

Speed

This is a thing needs to be solved for s3 backends.

I am currently using patches, both for prune[1] and restore[2] operations (which brought massive improvement). The released binary is sadly not so usable for big deployments in this regards.

[1] GitHub - cbane/restic at prune-speedup
[2] https://github.com/ifedorenko/restic/tree/out-of-order-restore-no-progress

cdhowie · February 8, 2020, 3:48pm

Note that you can just copy the config and keys to a new directory and have a functional repository with the same master key. Then you don’t need a patch; just cp -a $backup/snapshots $backup/data $backup/index $primary/ to copy the data over (or mv the contents of these directories). You’ll have some duplicate data on the next prune since nothing will have been deduplicated with prior backups, but that’s usually not a big deal.

gurkan · February 8, 2020, 4:29pm

Hmm, nice to know, thanks! But that can be a theoretical problem in a dynamic environment since I can’t guarantee that a new backup is just starting/started from a client which tried to reach primary at an unlucky time (e.g. last seconds of prune).
For now I am querying the snapshot ids from failover repo via restic and only issuing copy/forget commands for those snapshots.

cdhowie · February 8, 2020, 4:31pm

Would you not effectively have the same problem? If you query the snapshot ids before the backup is complete, you will not copy the backup from the failover repository to the primary repository.

gurkan · February 8, 2020, 5:40pm

Essentially yes, but since I didn’t want to do file-level operations safe enough, I thought leaving the risky ones behind until next prune is safer (data-wise).
Checking the logs now shows both approaches would be OK, I didn’t have any “hidden” snapshots yet

nicnab · February 10, 2020, 2:32pm

Are you aware of any good ressource on performance comparisons of using minio vs. ssh? The min.io website gives arguments but is it really faster to install minio on the target machine vs. just having restic ssh the data there?

gurkan · February 10, 2020, 4:29pm

I didn’t dig on benchmarking (which I regret now) since Minio’s feature set was amazing, especially if you need compatibility with amazon s3. But currently I can easily guess SSH would be way faster on big repositories.
s3 backend was very slow and I was considering to investigate alternatives, but I’ve found the mentioned branches which made it usable for us again.

nicnab · February 10, 2020, 5:34pm

Oh okay. Thanks! Glad to hear that as I usually always have ssh boxes as targets

rawtaz · February 10, 2020, 6:11pm

I think the REST backend with rest-server works great and has nice features.

gurkan · February 10, 2020, 6:49pm

I first tried to implement that one actually, then realized it misses some scalability features we’d like (also one service per project would make quite a noise).
Gave up after checking how stale the repository is. Now I see it was you who made commits after 9 months, cool

nicnab · February 11, 2020, 8:52am

What advantages would that be?

rawtaz · February 11, 2020, 11:21am

It might not be right for you, but the things I like are:

Getting the repository files stored in a nice directory hierarchy just like it would be stored on a local disk (useful when you need to look at it by other means than through restic). With S3 it’s stored in a different way (more like key-value files), I think?
You have the append-only option such that users can only add, not delete, data to their repos (useful when you do maintenance of the repos “on the side”).
You don’t have to fiddle with setting up shell accounts and what not, all users are defined in a htpasswd file.
You can of course make it such that users can only access their own repositories on the server.
It’s fast and for me it’s just worked all the time.

I do maintenance on the backup server, not from the clients.

alexweiss · February 12, 2020, 7:54pm

@gurkan About your maintenance comment: In order to not get slowed-down operations, it is necessary to remove snapshots (using forget) and to clean up and compact index files. The last one is done by prune, but prune does a lot of other things (including cleaning and compacting data aka pack files) and is hence very slow, especially with remote repositories, see also prune issues

If you do not care that much about the size of your repository, you could alternatively try

With this PR, cleanup-index followed by repack-index should do the job.
And you could also run infrequently cleanup-packs which is pretty fast without the repack-options and then deletes only completely-unused data files.

gurkan · February 12, 2020, 8:30pm

Yep, I am doing the prune after my forget loop for all hosts of a repo. But mentioned ~~out-of-order~~ prune-speedup patch helped a lot for speeding up prune. And since I’ve implemented the failover repositories, I can compensate the exclusive lock & prune time for now.

I’ve already built that branch but just tested these 3 commands on a mid-sized repository. Not implemented in my main binary yet though, since patches started to conflict with each other at some point and I don’t know enough Go to maintain a fork.

My current focus is your index memory speed up branch. Our clients started to get more OOMs due to restic lately and this caused some frustration on the whole idea. For the other “alternative” binary I have, I guess it can wait more for upstream to catch up.

Thanks for your efforts btw

nicnab · September 9, 2020, 8:45am

Just to follow up on this: I finally took the time and converted two customers’ restic backup constellations to rest-server and am happy to report that it didn’t take long and I had no problems at all. Using rest-server is very nice and more secure indeed - and really fast. Thanks again for the advice!