Optimizing backup for slow disks

akaFTS · October 9, 2022, 5:05pm

I have two machines with a similar amount of data (~600GB) uploading backups to Google Drive. Both use the same settings: 32MB packsize and 8 concurrent file reads.

However, one of them has a simple HDD and the other has two fast SSDs running in RAID 0. The difference in backup time is very significant: the former takes 3:30 hours and the latter 40 minutes. Running the Linux top command shows that the slower machine indeed has its CPUs waiting for I/O most of the time while the faster one barely ever waits.

I won’t expect the HDD machine backup to be ever as fast as the SSD one for obvious reasons. Still, are there any tweaks in restic that might be able to improve speed when disk is the bottleneck?

nicnab · October 10, 2022, 8:39pm

Hmm I’m by no means pro enough to be able to help here but can you elaborate some more? restic version and command, system details and how do you see your CPUs are waiting for I/O?

From a layman’s perspective, ~50 megs/s generally doesn’t sound too bad.

akaFTS · October 10, 2022, 8:53pm

I actually expressed myself wrong. The disks do have ~600GB each but only around 2 GB of data is modified at each backup.

I’m using restic backup with restic 0.14 compiled straight from the master branch in order to use the --read-concurrency flag which is not in the stable release yet.

When you run the top command on Linux and then press 1, you can see a breakdown of how each core is being used. There is a column called “wa” which stands for waiting for I/O; this column hits around 80% in the slower machine (meaning that the CPU is waiting for disk for 80% of the time) and around 1% in the faster one.

nicnab · October 10, 2022, 9:03pm

Thanks! Sorry for asking stupid questions while the experts are probably still brewing their morning coffee but I understand “waiting for I/O” in a way that you HDD can’t deliver the data fast enough, correct? If you read 8 concurrent streams of data from a slow HDD that should probably be expected as the HDD will be mostly busy moving heads from one place to another and rarely get to read the files that are there.

Have you tried with the standard restic settings or 1 concurrent read if that’s faster on this system?

akaFTS · October 10, 2022, 9:24pm

Thank you for the suggestion but yes, I tried 4 concurrent reads instead of 8 and it got worse. 16 didn’t seem to improve it though, perhaps 8 is the magic number.

MichaelEischer · October 11, 2022, 8:08pm

You might want to give Improve archiver performance for small files by MichaelEischer · Pull Request #3955 · restic/restic · GitHub a try and switch back to the default read concurrency of 2.

Also as mentioned in the --read-concurrency PR, that flag should only be necessary backups from/to really fast storage.

What is “got worse” in numbers? A bit slower, like 4 hours instead of 3:30? Or more or less?

How many files does the backup data consist of? 600 GB in 100 or 1 million files are two completely different things in terms of possible bottlenecks.

Which filesystem is the data stored on? After the initial backup, restic should skip all files which were not modified since the last backup. Thus I’d expect things to take minutes not hours.

akaFTS · October 12, 2022, 4:14pm

Yes, precisely 4 hours instead of 3:30.

This is the detailed structure of the backup as reported by restic itself:

scan finished in 2439.076s: 4005609 files, 589.760 GiB

Files:        2007 new, 1884027 changed, 2119575 unmodified
Dirs:           95 new, 211970 changed, 78061 unmodified
Data Blobs:   9629 new
Tree Blobs:  169413 new
Added to the repo: 5.125 GiB

processed 4005609 files, 589.761 GiB in 3:36:06

And here is a backup from a much larger disk in way less time:

scan finished in 4118.295s: 10172846 files, 1.646 TiB

Files:       18523 new, 66073 changed, 10088250 unmodified
Dirs:          588 new, 15140 changed, 1243392 unmodified
Data Blobs:  17613 new
Tree Blobs:  14610 new
Added to the repo: 4.533 GiB

processed 10172846 files, 1.646 TiB in 1:15:32

I will try out your PR with the default read concurrency and report back here.

MichaelEischer · October 14, 2022, 7:02pm

Apparently half the files in the dataset are detected as changed. Checking the content of 2 million files takes time. The key question here is why? Which filesystem is the data stored on? Is it expected to have that many file changes?

akaFTS · October 15, 2022, 5:40pm

You are correct; my slowest machines are indeed the ones with the largest amount of changed files. Since the backup routine is daily, there should definitely not be 2 million modifications. I will investigate further on what might be causing that, such as the filesystem as you mentioned.

In the meantime, I’d like to understand a little better how this scanning works. Supposing that I simply touch two million files, restic will need to open each one and compare its content to what it has in the repo, realize that nothing changed and move on. But if I add a few extra bytes of text at the end of each file, will restic copy the entire changed file to the repo or only the modified blocks? I’m asking this because I have some Microsoft Outlook .pst files that weight ~100GB and get touched everyday, but restic seems to only upload the changes since the backups have an average of 2-3 GB daily instead of hundreds.

MichaelEischer · October 16, 2022, 10:16am

While reading a file, restic will split it into smaller chunks (about 1MB) and will only store new chunks (deduplication). Thus when appending new bytes to a file, likely only the last chunk will change. Although for small files, this essentially means storing the whole file again.

akaFTS · October 16, 2022, 11:42pm

This is very interesting, thanks. Calculating some sort of checksum for each 1MB chunk, comparing to what is stored and uploading the diffs is not cheap, so I imagine there’s an initial step of checking the file’s last modified date before doing this deeper comparison? This would mean that if any program in my machine is touching several files unnecessarily, restic needs to go through this deep scan for each one and that might be what is causing the upload to be slow.

MichaelEischer · October 17, 2022, 8:10pm

Exactly. The backup command by default only reads files again which have a different last modified date, file size, etc.

This scan of 40MB/s is probably limited by reading the data from disk. Checking for duplicated should manage > 100MB/s using a single CPU core. And more CPU cores improve performance.