22TB restore: Mistakes were made. Lessons learned

Restoring from from a mounted repo stored on both Google Cloud Storage and S3 was running at 3.5 to 7.5MB/s on a 1Gbit line. The destination was a RAID array capable of writing data at nearly 10GB/s. I tried a lot of tricks to speed this up. Most of them didn’t work. I’m documenting some of them here in the hopes someone else might avoid any costly experiments.

Convert coldline to standard
The first thing I thought was slowing the process down was the storage class. The blobs were all nearline or coldline. Turns out on GCS the rewrite operation wasn’t working so the workaround was to clone the entire repo into another bucket with standard storage class. This was expensive and did not speed up the restore. Would not recommend.

Proxy data on a fast cloud box
I provisioned a compute instance with a high-performance SSD attached, large enough to hold the entire repo and then copied the repo over. I then tried the restore from that location. This was also expensive and did not increase performance. Would not recommend.

Copy the repo locally
There was enough extra room on the RAID array that I figured why not copy the entire kit and kaboodle down and restore from there. Not only was this a very very slow operation, it took a lot of retries for individual files. This was the most disappointing effort because of all the extra time it added to the operation. In the end, mounted restore speeds were only marginally faster, and that’s when the multiple concurrent rsync operations didn’t all hang because the RAID server was both the source and destination for the restore. I’d often find all the rsyncs hung because one of them was missing a single blob file. Would only recommend if the source and destination are different filesystems, and the source has fast read access.

Multiple concurrent rsyncs from cloud
In the end I went back to basics. I had already broken the restore down into several scripts that would run rsyncs on different parts of the restored filesystem. I can run them all in parallel and they all run at the original rates from above. The individual restore speed is no faster, but the aggregate restore is much better. Mounting directly from the cloud proved to be the most effective because pulling multiple blobs at random is no sweat for the cloud. In the end, this is the most effective method I could find. If I had to do this over again, I’d write a script that would carve up the final filesystem and automatically generate any number individual rsync scripts to run in parallel.

Hope this helps someone down the line.

5 Likes

Thank you so much for sharing.
My english is not perfect, did I understand right that you were restoring using restic mount and then rsync from that?
Because this I cant recommend in general. Using restic restore is much faster.

Thanks for the clarifying question! All test cases were performed with a restic mount and rsync. This is because we knew there would be interruptions during an operation of this size. As far as I know restic restore is still a monolithic operation, so the only apparent option was mount+rsync.

2 Likes

Good point. Haven’t thought about that by now because my largest restore by now was 150gb and ‘restic restore’ did that more or less while I had lunch.
So you are right, rsyncing from mounted repo seems to be the only way to perform multiple operations and/or resume after interuption.

@tiz.io Can you give some information about which restic version you were using? There have been quite some improvements with mount in recent versions. But restore is of course still faster.

All restore operations were performed with 0.12.0. FWIW, the repo was created over time with all versions from 0.10.0 on.

If you’re already scripting, why not make the script run restic restore on parts of the file tree instead, and if theres a problem restoring one, retrying it later. I wouldn’t even think of using the mount for restoring 22 TB :slight_smile:

I’ve never considered the difficulties in restoring large repos (particularly remote and “partially-different” ones), having only run smaller test restorations. This sounds like pretty core functionality for software like restic; having to resort so quickly to scripting doesn’t seem ideal.

Given the limitations with using the mount option, would it be useful for restic restore to support difference-only restoration (optionally deleting files in the restore destination which are not present in the repo)?

I’m not a developer, so don’t know how big an effort this would be, but naively assume a lot of the needed comparison functionality must already be present, as used by restic backup?

There is already

You are right - it is not much programming to solve, but the technical details are a bit involved: restore is pretty fast as it does not restore file-by-file (as would be done be the mount solution). Instead it reads complete pack files and writes the contained blobs to all file parts where they are needed.

It is easy to add a check which ignores a file for restoring if it is already completely present (e.g. by a former aborted restore). It would be however much more helpful to ignore all already correct blobs within a file, but restore the still missing blobs.

I can try to work on an implementation given I can spare enough time (unfortunately I’m quite busy ATM).

3 Likes

@rawtaz That could work well in another case, but this particular filesystem gets very wide very quickly. In a nutshell I could easily break this into two, eight, or ~500 discrete rsyncs without resorting to long lists of include/exclude statements. (Love the avatar BTW. I use that gif whenever someone asks what misophonia is.) But as long as I’m programming, I was quite tempted to modify restic…

@alexweiss This looks like it’s describing the ideal implementation. FWIW, I would have hacked together a mod that looked for existing files (ideally matched by size if feasible) and if it found a hit either skipped that file or rewrote the destination to /dev/null. In the end, modifying Restic was out of scope for this project since one of my objectives was to generate a DR playbook others could easily emulate.

Ah, I was thinking more in lines of first grabbing a list of what directories there are in a snapshot and then restoring parts of that list selected by whatever means is appropriate in your case, but using restic restore --include for each specific iteration. No need for a long list of include/excludes, assuming you script it. Regardless, the goal would be to not have to restart more than small pieces of the restore in case it’s interrupted somehow. But indeed, if you have a way to get a smarter restore instead, that’s better.

Thanks! I don’t know how flattered I should feel by being illustrated as a chewing gum annoyance, but oh well! :slight_smile:

1 Like

I created this PR:

Note that this is still WIP and need more testing.

@tiz.io If you are willing to test alpha code, I would appreciate your feedback on this change!

2 Likes

Does the 22TB restore contain lots of small files or mostly larger files (>1MB)?

The fundamental performance problem of a mounted repository is that files (and also all bytes in a files) are copied one after another. That is the content of a file is only requested when the file is read and therefore the throughput is completely dependent on the latency to the storage backend. Small files are particularly bad as each file requires a separate backend request. For larger files there’s some caching within restic and also the readahead mechanism in the kernel which prefetches the next part of a file. That performance problem is largely caused by the filesystem interface which doesn’t allow restic to plan ahead and return file parts in an optimized way.

I’ve noticed that restic did apparently not enable readahead in fuse, causing even reads from a local repository on a SSD to max out below 150MB/s, which is fixed by the following PR: mount: enable fuse readahead by MichaelEischer · Pull Request #3426 · restic/restic · GitHub . However, in my testing that optimization only helps with low-latency backends, So I’m not really sure how relevant that change is for remote access of GCS / S3.

Using rclone instead of rsync might also be an option. rclone supports transfering multiple files in parallel with rclone copy --transfer=16 src dst which could hide quite a lot latency.

1 Like

@rawtaz
Well you got me thinking what else I could do to stretch the limits and squeeze as much performance as I can out of a multi-threaded approach. I wrote a bit of automation that uses a combination of find, df and xargs to generate large files of hundreds of semi-balanced individual rsync commands and then uses xjobs to execute a limited number of individual lines in those files in parallel.
I found that on my machine (which is not typical: 48 vcpus, 256GB ram) I can maximize traffic from a single restic mount with 6 concurrent rsync pulls. This peaks at about 200Mbps.

Since going parallel seemed to be beneficial, I decided to take it to the next level and make four separate restic mountpoints all going to the same GCS repo. I redistributed the jobs file to use all four mounts and ran 24 rsyncs in parallel. Using this approach I was able to max out at 850Mbps. It’s absurd, but in a pinch it would definitely provide a resumable restore for very large repos.

All that said, it looks like @alexweiss has come to the rescue with a PR! Alex, I’m happy to run a test! I’ll get back to you in a bit with some results.

That’s great backround info. This test repo is kinda the worst of both worlds. There are hundreds of files > 20GB and also many directories with thousands of <100KB files. In my experience, you’re 100% right that the tiny files are by far the slowest going. Average transfer rates even when highly parallelized are a small fraction of the big files.

I compiled this PR and ran a restic restore latest --verbose 3 against a 2TB repo that had been about 40% restored. It has spent the last two hours processing and perplexingly downloading data from the destination NAS at about 1Gbit while only reading from the cloud repo occasionally at about 5Mbit on average. As far as I can tell nothing has been restored yet. I can’t say restic is up to because it hasn’t printed any messages since the initial snapshot ID despite the --verbose flag.

I haven’t used restic restore previously so I can’t tell if this is typical behavior, but I am concerned that restic has copied more data from the NAS than the restore target actually contains.

The PR initially checks every existing file for whether it already contains the expected contents. For that check it has to read and hash the complete file first. Based on a 2TB repository, that was 40% restored and is verified at 125MB/s, I’d expect this to take between 2 and 5 hours.

The restore command in general is very quiet, even --verbose doesn’t help ATM.

1 Like

As @MichaelEischer correctly mentioned, with my PR restic restore first reads the already restored files to verify them. Only then the correctly verified parts are omitted for the restore.

Did this PR help you speeding up your restore?

could you elaborate on this a bit? will restic request the blob and not get it or does restiv have a function to manage storage tiering in that case? Any idea why the mount worked despite this? is it “only” because of better retry?

maybe make this a two step process, first restore only missing (lost+found) files and files with different size and only with a special flag also do a deep compare with lower prio.