I’ve tried to implement restic on a ~2000-node environment. Although it’s not used as “the” backup solution yet, it is very promising so far.
So I’d like to have some notes from this, hoping it would help understanding bottlenecks people might struggle. I am sure more people wrote similar stuff and I am repeating by not reading enough, like the saying:
A couple of months in the laboratory can frequently save a couple of hours in the library.
But here it is.
Also mandatory note: Take this feedback with grain of salt, since:
- Restic is still the best backup tool I can see around.
- Some of the mentioned issues below are specific to our use case, storage selection etc.
- Community around the project is amazing
- Storage backend is Minio (one bucket per project, with very varying hosts per project)
- Using a handmade scheduler, which triggers backup operations on host-specific predefined “right” hours.
- Currently using a patched binary, due to reasons you’ll see below. I’d love to use the official binary, but not seems possible before some of the PR’s get merged.
- Running restic via a wrapper python script.
It’s safe to say, I wasn’t expecting that the prune operation will be problematic. I even postponed to run forget & prune for long time, then realized I have to run it often.
More snapshots & data meaning slower operations, even listing the snapshots started to took 2 minutes at some point.
Currently running prune every 2 days.
If your repository is big enough, few hours of prune time is not surprising. With that many hosts, you either need to invent something which dynamically chooses best possible time to prune, or improvise. Otherwise you’ll need to “skip backup” (bad) or “backup on the wrong time” (worse) for a lot of clients.
Currently we have one failover repository for every project. After failing to back up due to the exclusive lock prune needs, the wrapper script sends the snapshot to this failover repository. After the prune operation, snapshots are copied from failover repository to the main one.
Not the optimal solution, but seems to work.
 Using the patch adding copy command from this branch: https://github.com/middelink/restic/tree/fix-323
That is the current haunting issue for my setup. For big repositories, restic easily grabs more than a GB memory, which is not acceptable for some situations.
If I catch a restic instance gets killed by kernel due to memory pressure, I immediately stop all operations. But even that is not enough as you can imagine.
But there seems to be the light at the end of the tunnel, just tested a patch which brings visible improvement.
This is a thing needs to be solved for s3 backends.
I am currently using patches, both for prune and restore operations (which brought massive improvement). The released binary is sadly not so usable for big deployments in this regards.