Optimize backup process steps

mic_p · June 7, 2018, 8:05am

Hi @all,
I’m playing with restic, big datasets (> 1ml small files, 250GB), short window backup policy (1 snapshot every 4 hour), where every backup need to save less than 100 / 200 files (or sometimes even 0 files)

It works well, but I’m trying to analyze how to optimize the backup process time.

From my tests, restic split it’s internal process steps in (assuming the first backup is already done):

read index file (I see the memory usage grown from 10/20M to 350M, cpu 100%) and this take some seconds
scan the target path (without, more or less, memory grown, cpu 100% or similar) and this step takes two or three minutes. I think that restic scan the target path since it increments the "total files and GiB size). There is no ETA indication.
do something that I don’t know what it’s doing: this step takes a lot of time. The ETA is 3:30 / 4:00 h. In this step, memory stay there, cpu don’t exceed 10/15%. As said, in some case, restic don’t save file (since there is no file to save).

Is there the possibility to optimize the step 3? Or use some other trickle? (please not say to split the dataset )

If you need some tests, I’m here for help and the make restic better!

Thanks a lot

764287 · June 7, 2018, 9:42am

The CPU usage of 10-15% indicates that restic doesn’t perform on it’s full potential. The slowness of your backups might be caused by the hardware, the protocol or the backend. Can you share some more detail?

PS: I have a dataset of ~500k (mostly) small files which takes a few minutes to backup, hence I know that restic can be very fast on this kind of dataset.

mic_p · June 7, 2018, 12:18pm

Hi,
thanks reply.

I have a virtualization (vmware ent plus), skylake processor (hw is hp gen10), 16GB ram (used 6/8 gb and the other free), Netapp SSD storage 16Gbps multipath connected via FC.

I test it with multiple virtual machines and it the performance are the same.

On the same VMs, I have databases and other heavy software that use (and stress!) the storage and cpu much much more than restic

But the question is: why do 2 steps (in my list 2) and 3) )? And why the 2th (that scan all the files) are so quick then the 3th, no?
What restic do into the 3th step?

Backend aren’t called since a new (of modified) file need to be backed up, so isn’t this the bottleneck, protocol is s3 (minio) and the data are on “localhost” (a mapped disk F:\ into windows server)

Thanks a lot!
Michele

764287 · June 7, 2018, 1:59pm

I’ve seen several issues on Github that indicate problems with the S3 minio backend. Can you try again with a different backend?

mic_p · June 7, 2018, 2:59pm

Hi,
I don’t have problems with the backend: 1) I use local cache and 2) when there isn’t files to save, there is no network traffic (as I except since there is no needs to save data on the backend except the snapshot name and other small things)

I think that there’s something other that the devels, I think, known and I don’t!

Thanks

zcalusic · June 7, 2018, 3:07pm

@mic_p, I find vmstat 1 invaluable tool for debugging very subtle issues. If you can catch your problematic step 3, vmstat 1 can should you if you’re CPU bound or disk bound (seen as wait column values rise). Of course, you also might be network bound, so running ifstat 1 in another terminal is also a good idea.

Basically, all your computer resources are contested all the time, at one point the CPU might be saturated (you’re compiling for example), at another point your disk is the slowest link (and vmstat 1 and iostat -x 1 can detect that). Or network, where there are latency and bandwidth related issues.

Try to provide more information with these tools when step 3 is running, and then we can quickly pinpoint if it’s CPU, or disk or network or even memory (swapin/swapout).

mic_p · June 7, 2018, 3:21pm

Hi,
in this vm I have windows.
Have you some tools to advice that could replace the vmstat (or also strace) on windows?

Thanks

fd0 · June 8, 2018, 5:51pm

Which version of restic did you use to test? Run restic version to find out. The reason for my question is that this has changed slightly in 0.9.0 compared to 0.8.3.

Your analysis is correct so far: Step 1 reads the index, step 2 scans the target (so restic can display an ETA) and step 3 is walking the target, reading new files and checking files which have been in the previous snapshot for modifications. In 0.9.0 you’ll get a progress report for step 3 which shows you which files are processed right now, and step 2 and 3 happen in parallel.

If you did not discover it already: 0.9.0 has the option -v -v which will show you for each file in the target what restic does (new file read all, old file modified, old file unmodified).

mic_p · June 8, 2018, 7:46pm

Hi,
I have version 0.9

Do you think that 0.8.3 works better / different?
If you want, I can do the same test (in a different repo) with 0.8.3 and report the results

No, I didn’t know the -v -v option, but I’ll try it.

I’m asking you another question: how restic understand if a old file has been modified or not? Does it read and compare the st_mtime of lstat? Or does it read all the data of the file?

Thanks a lot

mic_p · June 9, 2018, 6:10am

Hi,
very interesting test did last night.
I tried to run restic 0.9 and 0.8 for backup the same dataset (that has ~ 780k / files and ~ 300G), using the same s3 repo and their local cache (using different --cache-dir).
Process results:

0 modified files, 0 new files
restic 0.9: ~ 3:30:00
restic 0.8.3: ~ 1:16

Restic 0.8 did it in 1 minute and 0.9 in 3 hours!

I did it 2 times, not concurrently, and the results are the same.

What’s now?

Thanks,
Michele

fd0 · June 9, 2018, 12:34pm

Please file an issue on GitHub and fill out the issue template (the runs for 0.9.0 are sufficient), maybe this is a bug so we need to dig in. It looks like the new version re-reads everything, which may be a regression.

If you can reproduce the issue (second backup with 0.9.0 is still slow) with a smaller data set that’d be helpful.

Thanks!