Restore strategy

Sina · November 23, 2019, 6:01am

StrikeLines’ post about his restore case made me think about certain things…

What might be best practice for desaster restore with restic?
Is it possible to forecast memory usage of certain operations? (particularly mount, restore and stats)
Are (from an overall perspective) per-folder-restore-operations (one by one starting with the mostly essential ones) significantly slower than full restore from the start? (I guess yes)
What is best practice to forecast time left until restore processes are finished?

Backup strategy: Would it be beneficial (for restore speed) to split all the data to backup into groups of similar importance? (I guess yes at least when using dedicated restic repositories for each importance class (at the cost of deduplication benefit). Unsure whether splitting in different snapshots might also help a bit.)

Best way to find out of course would be to test desaster restore scenarios which always is advisable, but on the other hand often quite time and resource consuming. Anyone tried different approaches who would like to share his findings?

From a theoretical perspective: Could anyone with deeper knowledge of how restic works give some advice?

cfbao · November 23, 2019, 4:14pm

I don’t think there’s solid way to estimate restore memory usage other than some heuristics from past case study.

On backup strategy: if you have multiple data sets that don’t overlap much, back them up to separate smaller repos, not one large repo. In particular, have your most important data backed to a dedicated repo.

StrikeLines · November 23, 2019, 5:05pm

Thanks for making this post. I was going to post a followup to ask for advice on this very topic.

We absolutely must get this data restored before Black Friday. Unfortunately it’s all buried in a single 50TB backup repository, but there’s no fixing that now.

My current plan is to go in this weekend and convert one or two of our high-performance data processing servers into dedicated restic restore machines. These servers have two physical CPU’s, 32-cores, 130GB of ram, and 20 Gb network connections.

Our backup server containing the restic repository has a six-core processor, 32 GB of ram and a 10 Gb network connection. The backup repository is shared to the local network with Samba.

The target for the restored files is a new enterprise FreeNAS ZFS server that we built to replace the RAID-5 array that failed. This server also has a 20gb network connection.

My strategy is to organize the repository into 5-6 subsets of about 10 TB each. Each subset will be restored using the --include flag on a separate restic restore instance running on the high-performance servers.

Questions:

Rather than restoring across the network, could I get better performance by cloning the repository to the FreeNAS server, then restoring it in place? (It would take about 15 hours to copy the repository across our office network.)
How much memory and processing resources will a single instance of restic restore use? All available, or is there a cap? Would it ultimately be faster to restore each subset sequentially, rather than simultaneously, on our high-performance servers?

2b. Where is the performance bottleneck during restores? Disk read speed? Network speed? Processor cores? RAM?

Is there a better way to do this? Should we avoid restic restore, and try to mount the repository and manually copy files off? Is a mount and rsync potentially faster?
Does --exclude behave the same for restores as it does for backups? Can you use wildcards to exclude certain filetypes from the restore? For example: –exclude="*.tiff" to skip geotiff files during the restore process. It’s unclear from the documentation if this will work during a restore.
Are there any unofficial forks of restic that are optimized for faster restores?

Thanks for the advice. We desperately need to get this data restored.

robert · November 23, 2019, 6:12pm

If the repository is on another machine network-latency would probably be the bottleneck.

I would suggest that you try the branch of pull-request #2195 [1]. In my tests the speedup was huge, I got around 1MiB/s with the current restore implementation, compared to nearly maxing the 250Mbit/s connection with 28MiB/s. This was between two different datacenters (OVH Gravelines -> Hetzner Nürnberg).

Even if the repository is on local storage, I would assume that the “out-of-order restorer” is much faster than the very conservative mainline implementation.

A few links that may help you to understand the current state better:
[1] https://github.com/restic/restic/pull/2195
[2] https://github.com/restic/restic/issues/2074
[3] Degraded restore performance (S3 backend)
[4] Slow restoring speed

StrikeLines · November 23, 2019, 11:51pm

Thank you for the info on PR#2195. I’ll compile that branch on our processing server and give it a shot. The discussion on that branch seems very promising. Cheers!

ifedorenko · November 29, 2019, 1:21pm

Restore memory footprint should be more or less stable after it loaded repository index and calculated restore plan. Guess this is little late now, but you should get good idea of memory usage by running restore until it starts writing files.