I’m testing restic 0.9.4 as a possible solution for backing up relatively large collection of files (a server users’ home directories and research data files) to a remote storage via sftp backend. The data size of both, the remote restic repository and the original collection, is about 250GB. There is ~50K files in the repository and ~780K in the original collection. Both, my server and the remote storage site, run CentOS 7.
I’m fully satisfied with the backing up process (really decent performance!). However, restoring data raises questions. In my first test I tried to restore a 3.3GB, ~400 files (sub-)directory tree (a small subset of the original collection):
it took 24min, thus, the average data transfer rate was only 2.2MB/s. The network link performance between the local server and the remote storage site is 30-40MB/s (experience of routinely using it) and the test was repeated multiple times. So, 2.2MB/s is not a random drop in the link performance.
In the following up tests I moved the repository around: put it on another local server and 2 more remote sites located at different distances (so network latency varied from 0.05ms to 50ms). Then, I restored the same subtree on the same server with the command above just varying the remote repository and cleaning up /scratch directory before each test. Here is my results:
latency to the test (restore) Avg.data transf.
repository host time rate
0.05ms(local host) 53s 60MB/s
1ms 130s 25MB/s
50ms(orig.test) 1450s 2.2MB/s
50ms(anoth.site) 1490s 2.2MB/s
Each test was repeated multiple times to ensure that the results are reproduced. So, it looks like it’s a feature of restic to slowdown with the distance to the repository location (in terms of network lag). Is that right ? Or should I look for other reasons of the poor restic restore performance for the far (50ms) sites ?
Poor restore performance is a known issue and has been discussed in Github issue #2074. @ifedorenko opened Pull request #2195 which should greatly improve the restore performance. Unfortunately this Pull request hasn’t been merged yet, so you either need to build restic yourself or wait for @fd0 to find some time to review the Pull request.
@d_v_v do you think you can test PR #2195, ideally with different numbers of restorer workers? I only tested with http-based repositories and don’t really know what to expect from sftp.
Somehow, reading the forum I got the impression that improvements regarding restore function has been already implemented in 0.9.4 release. And I did observe that files were restored in parallel.
This is correct, 0.9.4 restore is expected to work fast in most cases. The only known exception is restore of very large files (hundreds of of megabytes), which PR #2195 is meant to improve.
Hm! Then, It’s probably not going to help. In my restore sample the average file size is 3300M/400=~8M, the maximum file size is ~200M (there is very few of them).
I’ll try it anyway.
My impression was that during the restore operation restic makes the decision what chunks/packs to download next based on recently downloaded chunks. This would explain why the transfer speed drops with network latency. Roughly speaking, restic (my guess) has to download a group of packs, examine them, then it will know what packs to download next. At each such step network lag would introduce an extra delay.
For tree objects, yeah, this is the case. The trees list out directory contents, which includes tree IDs for subdirectories as well as blob IDs for files. If your backups have a lot of directories then there’s more round-trips to fetch them.
Here are new results. $ git clone https://github.com/ifedorenko/restic.git ./restic2195 $ cd restic2195 $ git checkout out-of-order-restore-no-progress $ go run -mod=vendor build.go
repeated with different repos: $ time ./restic -r sftp:remote_host:path_to_repo restore latest --include path_to_subtree --verbose --target /scratch/
latency to the test (restore) Avg.data transf.
repository host time rate
1ms 101s 33MB/s (was 25MB/s)
50ms(orig.test) 144s 23MB/s (was 2.2MB/s)
50ms(anoth.site) 107s 31MB/s (was 2.2MB/s)
So, the change is impressive! I haven’t try different numbers of workers. As of my concerns it already works great.