Restic 'snapshots', 'ls', 'restore' speeds on B2 (share yours!)

Now I have some active repository on B2 I am doing some testing. If you are using B2 please add you timings and experiences here.

The first repository has 17,500 snapshot of one file each (--stdin database backups).

Starting with no cache from the US east from two different location with ~1Gbps (symmetric), it took 4-6 hours for restic snaphots and about a 70MB cache. Then 20-40 seconds for a repeat restic snapshots with the cache in place. Then to do a restic ls of one snapshot (which contains a single file) took ~50 minutes the first time, and then 10 seconds to repeat for the same snapshot.

The second repository has 2800 snapshots of 110k files

Starting with no cache from the US east from two different location with ~1Gbps (symmetric), it took ~5 hours for restic snaphots and about a 600MB cache. Then to do a restic ls of one snapshot took about 17-25 minutes the first time, and then 17 seconds to repeat for the same snapshot.

I’ll keep testing. So far I am surprised that the time for restic ls depends more on the number of snapshots in the repository than the number of files in the target snapshot. I expected it to be the other way around.

Question, does the b2.connections option have any impact on restic snapshots or restic ls?

Currently every snapshot means additional small file to download. So basically you’re measuring latency.

See followed issue for details:

Thanks @dionorgua, for restic snapshots I agree we are measuring latency. That and (lack of) parallelism. If there are 10,000 snapshots the latency would have less impact if it were downloading the small files for 100 snapshots at once. That’s why I asked if increasing b2.connections would increases the parallelism of snapshots and thus combat the latency.

For restic ls though, how does the number of snapshots impact that? However many snapshots there, are I am only ls one of them by it its identifier. So why can’t restic go straight to that snapshot?

In the fast case there are many files and parent snapshots to work though to achieve the ls, yet it is faster that the smaller, simpler ‘ls’ with more snapshots present.

In the slow case there is only one file in one snapshots and no parent snapshots to consider. Yet is it much slower than an ls of the much larger more complex snapshot with fewer snapshots present.

In both cases we have run restic snapshots first, so all the snapshot files are downloaded before we start the ls, so - intuitively - it shouldn’t be the number of snapshots that impacts ls. But maybe there is a reason I don’t understand, not being familiar with the innards! :blush:

I don’t think latency fully explains the ls behavior?

Just tested a restore with restic 0.9.2 of 74G from B2 with a round trip to B2 of 76ms. It consisted of 87,500 files, so an average file size of 800K. I used -o b2.connections=10 though I am not clear if that does anything.

It took exactly 14 hours, so a file rate of about 1.7 files/second and 1.4MB/s. The bandwidth available is about 3000MB/s so that is 0.04% percent saturation of the bandwidth. The memory and CPU use by restic was minimal and fairly static.

This is obviously terrible, but I assume down to the single threaded restore and latency.

@fd0 When there is a build of the multi-thread restore, please put up an alpha release on github, I’ll download and run the same test.

Just restored some single-file database snapshots from a repo with 17,500 snapshots. After running restic snapshots each restore on one 4MB file took about 1 minutes.

@fd0 which restic operations lock the repository and which are blocked by locks? I don’t think the documentation site covers this.

This is probably all caused by restic listing and downloading the snapshot files from a high-latency remote like B2. Currently, all operations which download the snapshot files do that sequentially, so the runtime increases linearly with the number of snapshots. Some operations (like backup) list and download the snapshot files implicitly, in this case to find a previous snapshot. You can disable that with restic backup --force (maybe we should automatically turn this on for --stdin…).

All operations create (and regularly refresh/recreate) a lock file. The operations which delete data from the repo (forget, rebuild-index, prune) need an exclusive lock, so that fails when any other operation is still running. Other operations like backup only need a non-exclusive lock, so unless one of the exclusive operations is running, multiple instances of restic can backup to the same repo.

Thanks for the tip about --force. I’ll try that with the --stdin database backups.

If I use --stidin with the same host, tags, path, then there might be some value is finding the parent? But yeah the rest of the time --force sounds sensible.

Prune of repository with 17854 snapshots (with nothing forgotten)

This prune was run without forgetting anything, so it was a null operation in that sense. Starting with a local cache of all snapshots, a prune took 7.6 hours. The early stages took around 3 hours, then the rest of the time was consolidating the snapshot indices. After the prune, the whole repository is about 60Gb and 58,000 files.

$ time restic prune
repository 88b2b1b2 opened successfully, password is correct
counting files in repo
building new index for repo
[1:45:29] 100.00%  39377 / 39377 packs
repository contains 39377 packs (67853 blobs) with 54.278 GiB
processed 67853 blobs: 0 duplicate blobs, 0B duplicate
load all snapshots
find data that is still in use for 17854 snapshots
[4:03] 100.00%  17854 / 17854 snapshots
found 67853 of 67853 data blobs still in use, removing 0 blobs
will remove 0 invalid files
will delete 0 packs and rewrite 0 packs, this frees 0B
counting files in repo
[52:29] 100.00%  39377 / 39377 packs
finding old index files
saved new indexes as [a6fcda4b 785b9556 82079593 0476dd31 23124853 cd0a2e43 513b77f1 7ce9435c e9618549 dd7a34d0 d2ba118f f90a4bb8 fd2c591a 97705eb2]
remove 17854 old index files
done

real	461m19.283s
user	4m34.314s
sys	0m44.614s

Restore with restic 0.9.2 of 100GB (110,00 files) from B2 with 70-80ms rtt latency over 3Gbps (at my end) took about 20.5 hours. The gives a date rata of 1.3 MB/s. That is pretty consistent with my other tests. 13Mbps is about peek restore performance (for regular-sized files).

Looking forward to testing a multi-threaded restore!

repository 541ced12 opened successfully, password is correct
restoring <Snapshot 21833f8b of [/foo] at 2018-08-31 16:16:46.450291073 +1200 +1200 by @bar> to /mnt/test/

real	1224m48.702s
user	79m45.756s
sys	10m8.492s

Hm, which stage is that? I can see from what you pasted that the stages which usually take long only sum up to ~2h45m or so, but the complete runtime is more like 8 hours. This looks odd to me, I’d like to find out what’s taking so long (without a status update)…

It was the remove 17854 old index files that took the balance of the time. I assume it is deleting 18k files synchronously with RTT and API latency at approx 1 file/second?

Is that what it is doing? Ideally any time you know the list of file names you need to copy or delete in advance, it would be nice if restic had a multi-threaded or asynchronous task handler to tackle that, and so eliminating latency delays.

Oh, ok, that sounds plausible. Yes, deleting many files is (at the moment) a sequential operation. That’s not great, I agree.

Without doing anything else I ran prune again. This time the indexes were already consolidated and so that last part didn’t happen, and the earlier stages were a bit faster too, and the total time was under 2 hours.

$ time restic prune
repository 88b2b1b2 opened successfully, password is correct
counting files in repo
building new index for repo
[54:51] 100.00%  39377 / 39377 packs
repository contains 39377 packs (67853 blobs) with 54.278 GiB
processed 67853 blobs: 0 duplicate blobs, 0B duplicate
load all snapshots
find data that is still in use for 17854 snapshots
[0:01] 100.00%  17854 / 17854 snapshots
found 67853 of 67853 data blobs still in use, removing 0 blobs
will remove 0 invalid files
will delete 0 packs and rewrite 0 packs, this frees 0B
counting files in repo
[48:12] 100.00%  39377 / 39377 packs
finding old index files
saved new indexes as [a56f19a8 3f6f1941 db34c086 95c0da70 eca2c694 3148ad84 0703588f 54510e21 37a7c345 0c56565e 7faec841 291766f8 8c4e219e 7338cf44]
remove 14 old index files
done

real	104m5.603s
user	0m30.862s
sys	0m8.363s

A ‘check’ of the repo with 17854 snapshots takes about identical time as a re-run of ‘prune’ when nothing needs to be pruned.

$ time restic check
using temporary cache in /tmp/restic-check-cache-692616028
repository 88b2b1b2 opened successfully, password is correct
create exclusive lock for repository
load indexes
check all packs
check snapshots, trees and blobs
no errors were found

real	102m30.515s
user	0m47.291s
sys	0m19.290s

Doing some testing with restic 0.9.4.

Starting with no local cache, a ‘restic snapshot’ of 3,850 snapshots on B2 took 75 minutes, rough 1 second per snapshot. This reflects that snapshot are separate files, and there is no parallellism employed to fetch them. With a warn cache it takes ~80 seconds. So still important to always have a ‘warmed up’ cache on your restore environment.

To warm up an empty cache I ran restic snapshots over 20 B2 repos with an average of 3000 snaphots each, default options. That took almost exactly 12 hours, about 1.3 snapshots per second. Of course with lots of smaller repos, I can run restic in parallel. And that is probably the answer for now, keep lots of small repos.

Working on a restore test of ~450GB snapshot from B2, over a long distance (~200ms) with default settings.

So far in 1.5 hours, restored 73k files totalling 27GB. So about 50k files/hours at 5MB/s.

Is b2.connections still 5 by default, maybe I should try again with larger values.

Lifting b2.connections from the default 5 to 200 helped a lot but not at must as I might have hoped.

In 1.5 hours, restores 93k files totalling 42GB. So 65k files/hour at 7-8MB/s.

That’s only a lift of about 50%. I think we are hitting against the hard-coded restore limit of 8 restore workers. So with the default settings, we were limited by the B2 connection limit of 5, and now we are limited by the restore worker connection limit of 8.

Unfortunately the restorer work limit is hard-coded at 8, there is no --option for it. So I can’t easily test that theory. @fd0 are there any other configurable tuning settings I can try?

I did a restore test of 130k files totaling 130GB from a 500GB repo from B2 at ~200ms with default options. This took 6 hours and 4 minutes to restore. That is a little over 6 files/second or 6MB/s.

I’ll add more tests here with more B2 connections, running in parallel and when it is configurable, more restore workers.

For comparison, do you have an idea of how long a similar operation would take restoring from a local repository on the same hardware?