We’re using restic 0.12.0. I see there is a 0.12.1 last month so will update. Backing up to B2 with up to ~180ms network latency.
Thank you for those optimization ideas, they sound good because storage is often cheaper than time. What we’ve done to optimize it ourselves is split back-ups into silos, with each backup target having its own server-side repo and client-size cache. That means we can break up prunes into tasks that are no more than about 5 hours each.
The forget/prune time is heavily dominated by deleting things (>95% of total forget/prune time), which doesn’t seem at all parallel? Although finding used snapshots and repacking takes more time per operation, there is not very much of that to do. However if you are deleting 100,000 things at ~0.17 seconds per delete, that prune is going to take about 5 hours. With our silo approach we can break up the 24-48 hours into silos that take 5 hours or less and we can do in parallel.
Here the average restic B2 operation times we observe with network latency ~180ms:
Deleting snapshots for forget
: 0.18s / delete
Deleting obsolete indexes: 0.16s / delete
Removing old packs: 0.16s / delete
Finding in use data: 0.35s / snapshot
Repacking pack: 0.6s / repack
The delete operation seems basically the same at the network latency time. If so, then if your latency is 10ms you can delete ~100 things/second, is latency is 100ms you can delete ~10 things/second, and at our ~180ms latency we can delete ~5 things/second. Obviously bandwidth & CPU cores is not relevant here, everything is latency-bound. If we could run 20 delete operations in parallel we could probably reduce forget/prune time from ~5 hours to ~15 minutes.
There is also a trade-off with frequent smaller forget/prunes or doing that less often. Because we have to suspend back-ups to run prunes, we tend to do that less often, usually monthly.
Back-ups themselves are incremental with retained client-side caches and run frequently (down to 15 mins for production). They’re no problem, most take less than one minute to run since only uploading a couple new files. Restores with the new improved restore are pretty excellent and about as fast as you could hope for I think. The forget/prune is slow, which is a problem because it needs an exclusive lock the whole time. Slow and non-exclusive would be no issue. Or exclusive and faster.
I’ll add your suggested optimization options for next forget/prune run and see what the impact is