Performance for unchanged file detection (0.9.1)

whereisaaron · July 19, 2018, 5:46pm

I wondered how efficient or multi-threaded the detection of unchanged files is? I observed that backing up a folder with 99% unchanged files takes longer than I would have guessed.

Excluding an unchanged file is done (by default) by file size plus date? So basically one stat operation per file is required? It presumably has to look up the cache also to compare, but that can be done in parallel.

For a backup of ~100,000 files, where only two files had changes, restic is taking ~25 minutes. Virtually all of that time is trundling through unchanged files. So I am seeing an unchanged file processing rate of about 67 files/second.

To get a naive baseline, I tried this test below, which ought to be way less efficient than restic, since I am starting a new process for every stat call With this I am still getting about 286 stat calls per second.

# time (find . -print -exec stat \{\} \; | grep '^\./' | head -1000 > /dev/null)
find: 'stat' terminated by signal 13

real    0m3.449s
user    0m0.403s
sys     0m1.059s

So I wondered if restic is exploiting parallelism as much as it could to process unchanged files as fast as possible? @fd0 you know the innards. Is there some scope for improvement we should add to the road map here?

Arguably detecting unchanged files is restic’s most important feature to optimise, since it is what is probably spends the vast majority of its run time doing

fd0 · July 19, 2018, 6:04pm

There are two things which cause restic to be slow for (a lot of) unchanged files:

Listing directories and calling lstat on each item
Loading, decrypting and parsing the JSON data from the last backup

The former is what you recreated with the small script, the latter is probably what makes restic take its time. I think it’s not worth it optimizing this further, I’d rather pursue a different idea (eventually, anyway).

I’d like to add a database to the restic cache, which maps a file path on the local machine to the timestamp for the last change and the list of blobs. So when restic encounters a file, it can consult the database instead of decrypting/parsing the metadata from the repo. This will also solve all problems we currently have with detecting the parent snapshot.

It’s an optimization, so it’s pretty low in terms of priority.

mlissner · July 19, 2018, 7:09pm

EEK! FWIW, as an operator of a bigish server, optimization is the number two thing I’m hoping for after correctness.

matt · July 19, 2018, 7:10pm

I know what you mean. I’m sure a pull request would be welcome, after some design discussion and agreement.

ProactiveServices · July 19, 2018, 7:14pm

Keep in mind files such as VeraCrypt containers can be used in a manner which doesn’t update their mtime. Comparing file sizes as well as time stamps might be an idea.

fd0 · July 19, 2018, 7:34pm

There are much more important optimizations to do first, e.g. index handling, and reducing the memory usage for restic. We’ll get there, one step at a time. For a backup program, I think correctness/safety trumps performance.

fd0 · July 19, 2018, 7:39pm

Seriously? That sounds like a very bad idea, do you have a source for that? You can always use restic backup --force in such cases. For what it’s worth, restic compares, among others, mtime, inode and size: restic/internal/archiver/archiver.go at master · restic/restic · GitHub

ProactiveServices · July 19, 2018, 7:48pm

I guess it goes with the plausible deniability policy of VeraCrypt.

I first heard about it with the Syncthing project, as they also want to detect filesystem changes as well as parse them efficiently. I think I’ve read of a few other instances where mtimes aren’t always 100% reliable for change detection.

This from the VeraCrypt settings window - I believe the “Preserve modification time…” is enabled by default.

mlissner · July 19, 2018, 10:48pm

Don’t get me wrong, I still love what you’re doing. Just gotta advocate for performance if I ever see a chance!