-o b2.connections=N seems to have no impact

whereisaaron · September 8, 2021, 2:35am

It was just an example where connections can have both high latency and high bandwidth. In those situations restic is unable to use even a tiny a fraction of the bandwidth. It get’s throttled by latency because it has little no mechanism to handle it (for some operations).

Regardless of how slow your endpoint upload is, if your latency is high this will impact you, especially for operations like deleting blocks. You would also still hit this problem with any cloud → cloud backup, since cloud bandwidth keeps increasing, even though the latency cannot improve. As bandwidth improves, restic’s relative performance gets worse and worse. That’s why I flagged latency handling as an important modernization. Parallel and/or asynchronous operations are probably the low-hanging fruit to address this.

whereisaaron · September 8, 2021, 2:43am

The new restore is massively more performant on high latency connections, it really addressed the problem

It is the operations like forget and prune that are bottlenecks now. The easiest workaround right now is to keep backing up without forget/check/prune to one B2/S3 bucket for while, then start over with a new bucket repo, then later delete the whole first bucket. Otherwise we have to suspend backups for a couple of days to allow a forget/prune pass to complete (since it requires a lock, and a prune on B2 can take 24-48 hours to complete for us).

Thanks for the tip about backup limiting to core count. It would be great to un-shackle backup parallelism from the CPU core count. Not sure how CPU could ever be the bottleneck except is the use case of local-to-local NAS/SAN backup?

fd0 · September 8, 2021, 7:14am

Huh, did you check out the new options for prune? The option --max-unused unlimited will tell restic to only remove files which are completely unused and just keep the rest (even if they contain both data that’s still in use as well as unused data). If you don’t want unused data lying around you can limit the amount of data that is to be re-uploaded via --max-repack-size 500M to only limit the repacking step.

If you try that then please report back!

Just out of curiosity: which version of restic do you use? Which step in the prune process takes so long?

whereisaaron · September 8, 2021, 11:39am

We’re using restic 0.12.0. I see there is a 0.12.1 last month so will update. Backing up to B2 with up to ~180ms network latency.

Thank you for those optimization ideas, they sound good because storage is often cheaper than time. What we’ve done to optimize it ourselves is split back-ups into silos, with each backup target having its own server-side repo and client-size cache. That means we can break up prunes into tasks that are no more than about 5 hours each.

The forget/prune time is heavily dominated by deleting things (>95% of total forget/prune time), which doesn’t seem at all parallel? Although finding used snapshots and repacking takes more time per operation, there is not very much of that to do. However if you are deleting 100,000 things at ~0.17 seconds per delete, that prune is going to take about 5 hours. With our silo approach we can break up the 24-48 hours into silos that take 5 hours or less and we can do in parallel.

Here the average restic B2 operation times we observe with network latency ~180ms:

Deleting snapshots for forget: 0.18s / delete
Deleting obsolete indexes: 0.16s / delete
Removing old packs: 0.16s / delete

Finding in use data: 0.35s / snapshot
Repacking pack: 0.6s / repack

The delete operation seems basically the same at the network latency time. If so, then if your latency is 10ms you can delete ~100 things/second, is latency is 100ms you can delete ~10 things/second, and at our ~180ms latency we can delete ~5 things/second. Obviously bandwidth & CPU cores is not relevant here, everything is latency-bound. If we could run 20 delete operations in parallel we could probably reduce forget/prune time from ~5 hours to ~15 minutes.

There is also a trade-off with frequent smaller forget/prunes or doing that less often. Because we have to suspend back-ups to run prunes, we tend to do that less often, usually monthly.

Back-ups themselves are incremental with retained client-side caches and run frequently (down to 15 mins for production). They’re no problem, most take less than one minute to run since only uploading a couple new files. Restores with the new improved restore are pretty excellent and about as fast as you could hope for I think. The forget/prune is slow, which is a problem because it needs an exclusive lock the whole time. Slow and non-exclusive would be no issue. Or exclusive and faster.

I’ll add your suggested optimization options for next forget/prune run and see what the impact is

764287 · September 8, 2021, 12:07pm

Using a VPS located closer to B2 is not an option for you?

MichaelEischer · September 8, 2021, 8:46pm

The number of parallel delete operations is currently limited to 8, see

github.com

restic/restic/blob/bf9c8771a469f1ee0d4b6fab78791d36426371b2/cmd/restic/delete.go#L21

    
      
          func DeleteFiles(gopts GlobalOptions, repo restic.Repository, fileList restic.IDSet, fileType restic.FileType) {
          	_ = deleteFiles(gopts, true, repo, fileList, fileType)
          }
          
          
// DeleteFilesChecked deletes the given fileList of fileType in parallel
          // if an error occurs, it will cancel and return this error
          func DeleteFilesChecked(gopts GlobalOptions, repo restic.Repository, fileList restic.IDSet, fileType restic.FileType) error {
          	return deleteFiles(gopts, false, repo, fileList, fileType)
          }
          
          
const numDeleteWorkers = 8
          
          
// deleteFiles deletes the given fileList of fileType in parallel
          // if ignoreError=true, it will print a warning if there was an error, else it will abort.
          func deleteFiles(gopts GlobalOptions, ignoreError bool, repo restic.Repository, fileList restic.IDSet, fileType restic.FileType) error {
          	totalCount := len(fileList)
          	fileChan := make(chan restic.ID)
          	wg, ctx := errgroup.WithContext(gopts.ctx)
          	wg.Go(func() error {
          		defer close(fileChan)
          		for id := range fileList {

You could also try setting prune -o b2.connections=8 to get the full parallelism. The default connections limit is 5. This makes me wonder whether the operation times have to be multiplied by five. Which would mean 0.8s/delete.

whereisaaron · September 9, 2021, 3:32pm

Thanks, I had the impression from this article and others that -o b2.connections=8 went to the B2 library but was basically ignored by restic operations. I’ll definitely try -o b2.connections=8 to see if it helps. If we can uncap that hard-coded delete workers limit maybe we can make a dent in the times!

@764287 I considered running containers closer to the the B2 data center just for the forget/prunes. But the restic cache is necessarily in the countries where the data being backed up is. And we’d have to coordinate suspending the backups is those locations with the prune starting near B2. So possible, but a lot of stuff to make it happen.

whereisaaron · October 25, 2021, 4:12am

Last month I tested 0.12.0 with high-latency (180ms) access to B2. Backup and restore performance is great, but forget/prune performance is poor. This is due to how slowly restic can delete objects. And deleting object is ~95% of the total time a forget/prune takes.

From other comments the problem is the small cap on the number of allowed delete connections/workers. It appears that currently every connection requires a dedicated worker, and the number of workers defaults to 5 and has a hard-coded maximum of only 8.

The result is that restic can only delete about 6 objects/second by default (for 180ms latency), so a prune than needs to delete 100,000 objects takes ~5 hours. The original results are quoted below:

I have done some further testing with 0.12.1 with new options the community suggested:

--max-unused unlimited
-o b2.connections=8

Here the average restic B2 operation times we observe with network latency ~180ms:

Deleting snapshots for forget: 0.1s / delete
Deleting obsolete indexes: 0.09s / delete
Removing old packs: 0.1s / delete

Finding in use data: 0.16s / snapshot
Repacking pack: 1.3s / repack

Being able to increase the number of delete workers from 5 to 8, almost doubled the speed at which restic can delete objects! I saw a about linear speed-up with the number of connections/workers. This almost halves the total forget/prune times.

I would love to test with 16 or 32 connections/workers, but unfortunately the hardcoded limit of 8 workers also places a hard limit on the maximum speed of forget/prune operations. It would be great if this limit could be moved out of the code to a configuration setting. Or at least, in the first instance, to a build configuration.

github.com

restic/restic/blob/bf9c8771a469f1ee0d4b6fab78791d36426371b2/cmd/restic/delete.go#L21


      
          func DeleteFiles(gopts GlobalOptions, repo restic.Repository, fileList restic.IDSet, fileType restic.FileType) {
          	_ = deleteFiles(gopts, true, repo, fileList, fileType)
          }
          
          // DeleteFilesChecked deletes the given fileList of fileType in parallel
          // if an error occurs, it will cancel and return this error
          func DeleteFilesChecked(gopts GlobalOptions, repo restic.Repository, fileList restic.IDSet, fileType restic.FileType) error {
          	return deleteFiles(gopts, false, repo, fileList, fileType)
          }
          
          const numDeleteWorkers = 8
          
          // deleteFiles deletes the given fileList of fileType in parallel
          // if ignoreError=true, it will print a warning if there was an error, else it will abort.
          func deleteFiles(gopts GlobalOptions, ignoreError bool, repo restic.Repository, fileList restic.IDSet, fileType restic.FileType) error {
          	totalCount := len(fileList)
          	fileChan := make(chan restic.ID)
          	wg, ctx := errgroup.WithContext(gopts.ctx)
          	wg.Go(func() error {
          		defer close(fileChan)
          		for id := range fileList {

It is not clear what the --max-unused unlimited did as it wasn’t an identical starting state and the previous test. It probably reduced the number of objects that needed repacking and deleting, but that wasn’t dramatic. This was a smaller forget/prune, deleting ~40K objects and there were only ~100 repacks required with --max-unused unlimited.