We have been running into a few performance issues with restic with long running jobs and server crashes. A couple of examples of what we have seen are:
restic check consuming lots of CPU and Memory leading to server crashes or OOMs.
restic backup taking more than 4 hours on a 20GB repository
restic prune taking upwards of 5 hours on a 20GB repository
It’s possible that we’re running into some edge cases as we’re running restic across hundreds of production virtual machines ranging widely in resources (2 – 4GB RAM and 1 – 2 CPU Cores) and repository sizes (20GB – 400GB), but either way this is problematic for us.
One trend we have noticed is things really slow down on servers with lots of files. The 20GB repository mentioned above has over a million files, and I suspect restic is running into issues when it’s trying to diff or compare hashes to track what files have changed.
As such one potential idea we had to optimise things was to provide restic a pre-generated list of files that had changed so it doesn’t have to check every file. We considered using auditd to generate this file and then pass it to restic using the --files-from flag.
This gets us close but misses one key thing, we still want the full repository present in each snapshot. For example if we have a repository with 100 files, we would like this:
Hopefully that makes sense – it gives us a full snapshot each day for us to browse, but reduces the load on restic by having to only deal with the 20 changed files. We don’t think this is currently possible in restic, (happy to be corrected) but does anyone else think this would be useful to them?
If we could be pointed in the right direction and areas of Go that might be involved we may be willing to investigate if this is something we could build and contribute. We’re also open to any other suggestion on how to improve performance in these sort of scenarios.
Finally thank you to fd0 and everyone else who contributes to this project.
While restic’s ressource usage can be pretty high and prune+check can be quite time-consuming, the backup of a 20GB repository (even with 1 million files) should not take that long. This sounds like a network and/or remote bottleneck.
Just to give you and idea how fast a backup can be:
This is exactly the problem, and I don’t think it’s due to trying to “compare hashes,” I suspect that it just takes this long to read the indexes into a lookup structure in memory. This is also likely why memory usage is so high.
This backup was made on a host with ~300MB/sec read speed and the repository is accessed via SFTP.
Would you mind sharing what sort of resources the server running the backup had? Also how long does a check or prune take on that box?
I know we’re not network constrained and suspect our storage back-end is fine as well, it’s likely the actual VM doing the backup is the culprit as it’s doing other workloads at the same time.
In general I think we’re running into issues in low resource environments like I mentioned above which is why we’re looking into ways to optimise the restic in scenarios where we can’t throw more resources at it.
Personally, I would love to see a client/server split operation mode in restic where the server has the indexes hot in memory all the time, and the client is exceptionally dumb, telling the server what content it wants to store and the server replying with “I have that” or “I don’t have that, please send it.”
The client would use nearly no resources, and the server could service multiple running clients at the same time.
In this case it’s a VM (hosted at Hetzner in Germany) with 2vCPU (2-3% average CPU usage) and 8GB RAM (~2GB active). The last check took 41min. I don’t know how long prunning takes but will take a look for you as soon as possible.
What OS, and what other software running? I had a problem on Windows 10 with Restic being VERY slow, which I eventually identified as being a resident virus checker scanning every single file that Restic looked at the metadata for, even if it wasn’t backing up the file. I worked this out using the resource manager.
Resource Monitor. When Restic was running I watched the “Disk” tab, and I could see that the antivirus process “Avira” was hitting the disk a LOT, whereas Restic wasn’t. That and the rest of the information in the tab told me that every file that Restic requested metadata on (eg modification date) was being scanned by Avira. I added the restic process to the ignore list, it went a LOT faster after that, an order of magnitude or more.
counting files in repo
building new index for repo
[1:12:52] 100.00% 21759 / 21759 packs
repository contains 21759 packs (520825 blobs) with 104.011 GiB
processed 520825 blobs: 0 duplicate blobs, 0 B duplicate
load all snapshots
find data that is still in use for 323 snapshots
[18:59] 100.00% 323 / 323 snapshots
found 508137 of 520825 data blobs still in use, removing 12688 blobs
will remove 0 invalid files
will delete 698 packs and rewrite 1740 packs, this frees 8.562 GiB
[38:58] 100.00% 1740 / 1740 packs rewritten
counting files in repo
[14:19] 100.00% 20122 / 20122 packs
finding old index files
saved new indexes as [b7c9dc97 313f5321 d587cc9e ef693485 c242e921 37926a95 380be60a]
remove 259 old index files
[1:49] 100.00% 2438 / 2438 packs deleted