Restic 0.9.4 is still slow on restore (sftp backend)

d_v_v · April 3, 2019, 7:12pm

Hello,

I’m testing restic 0.9.4 as a possible solution for backing up relatively large collection of files (a server users’ home directories and research data files) to a remote storage via sftp backend. The data size of both, the remote restic repository and the original collection, is about 250GB. There is ~50K files in the repository and ~780K in the original collection. Both, my server and the remote storage site, run CentOS 7.

I’m fully satisfied with the backing up process (really decent performance!). However, restoring data raises questions. In my first test I tried to restore a 3.3GB, ~400 files (sub-)directory tree (a small subset of the original collection):

restic -r sftp:remote_host:path_to_repo restore latest --include path_to_subtree --verbose --target /scratch/

it took 24min, thus, the average data transfer rate was only 2.2MB/s. The network link performance between the local server and the remote storage site is 30-40MB/s (experience of routinely using it) and the test was repeated multiple times. So, 2.2MB/s is not a random drop in the link performance.

In the following up tests I moved the repository around: put it on another local server and 2 more remote sites located at different distances (so network latency varied from 0.05ms to 50ms). Then, I restored the same subtree on the same server with the command above just varying the remote repository and cleaning up /scratch directory before each test. Here is my results:

latency to the     test (restore)  Avg.data transf.
repository host    time            rate
0.05ms(local host) 53s             60MB/s
 1ms               130s            25MB/s
50ms(orig.test)	   1450s           2.2MB/s
50ms(anoth.site)   1490s           2.2MB/s

Each test was repeated multiple times to ensure that the results are reproduced. So, it looks like it’s a feature of restic to slowdown with the distance to the repository location (in terms of network lag). Is that right ? Or should I look for other reasons of the poor restic restore performance for the far (50ms) sites ?

Thanks ahead,
DV

764287 · April 4, 2019, 7:49am

Poor restore performance is a known issue and has been discussed in Github issue #2074. @ifedorenko opened Pull request #2195 which should greatly improve the restore performance. Unfortunately this Pull request hasn’t been merged yet, so you either need to build restic yourself or wait for @fd0 to find some time to review the Pull request.

ifedorenko · April 4, 2019, 11:45am

@d_v_v do you think you can test PR #2195, ideally with different numbers of restorer workers? I only tested with http-based repositories and don’t really know what to expect from sftp.

d_v_v · April 4, 2019, 5:14pm

Sure, I’ll do.

Somehow, reading the forum I got the impression that improvements regarding restore function has been already implemented in 0.9.4 release. And I did observe that files were restored in parallel.

Anyway, I’ll try and let you know.

ifedorenko · April 4, 2019, 5:40pm

This is correct, 0.9.4 restore is expected to work fast in most cases. The only known exception is restore of very large files (hundreds of of megabytes), which PR #2195 is meant to improve.

d_v_v · April 4, 2019, 6:37pm

Hm! Then, It’s probably not going to help. In my restore sample the average file size is 3300M/400=~8M, the maximum file size is ~200M (there is very few of them).

I’ll try it anyway.

My impression was that during the restore operation restic makes the decision what chunks/packs to download next based on recently downloaded chunks. This would explain why the transfer speed drops with network latency. Roughly speaking, restic (my guess) has to download a group of packs, examine them, then it will know what packs to download next. At each such step network lag would introduce an extra delay.

cdhowie · April 4, 2019, 6:47pm

For tree objects, yeah, this is the case. The trees list out directory contents, which includes tree IDs for subdirectories as well as blob IDs for files. If your backups have a lot of directories then there’s more round-trips to fetch them.

d_v_v · April 4, 2019, 8:18pm

Here are new results.
$ git clone https://github.com/ifedorenko/restic.git ./restic2195
$ cd restic2195
$ git checkout out-of-order-restore-no-progress
$ go run -mod=vendor build.go

repeated with different repos:
$ time ./restic -r sftp:remote_host:path_to_repo restore latest --include path_to_subtree --verbose --target /scratch/

latency to the     test (restore)  Avg.data transf.
repository host    time            rate
 1ms               101s            33MB/s (was 25MB/s)
50ms(orig.test)	   144s            23MB/s (was 2.2MB/s)
50ms(anoth.site)   107s            31MB/s (was 2.2MB/s)

So, the change is impressive! I haven’t try different numbers of workers. As of my concerns it already works great.

ifedorenko · April 4, 2019, 9:15pm

Thank you for testing. The 200M file is likely the long pole of 0.9.4 restore.

d_v_v · April 5, 2019, 5:07pm

@ifedorenko @764287 @cdhowie Thank you very much for your help! I hope this fix will make it to the stable release soon.

TechupBusiness · October 31, 2019, 5:17pm

Does this issue also apply to the check command? Because it is super slow (didnt do a test) but it takes ages, even without checking the files.

ifedorenko · November 10, 2019, 1:37am

Check is uses different repository file download strategy and is not affected by the restore inefficiency discussed here.

rawtaz · November 17, 2019, 10:31pm

What do you mean by “even without checking the files”?