Adding paths to backup results in all files being recognized as new

scopev24 · February 6, 2020, 8:32pm

Hi!

I’m currently in the process of evaluating restic - love it so far

One thing I noticed:
I do a test backup like this:
./restic -r /home/nscheer/backup --verbose=2 backup /home/nscheer/data/test_1

On a fresh repository, I can see the new files being backuped. On subsequent backups, every file is considered unchanged.

Now I add another directory like this:
./restic -r /home/nscheer/backup --verbose=2 backup /home/nscheer/data/test_1 /home/nscheer/data/test_2

Now every file (even from test_1) is recognized as new, resulting in a complete re-read.
That’s not a problem for my small test set, but I’m thinking of a huge backup where I just wanted to add a directory with very little content. This would result in a complete re-read of the files.

Is this the expected behavior? I can’t see an obvious reason, that when building some kind of file list internally, why the files already in the list should be considered “new”, even if the added directory does not share any path with the already existing one.

I guess this must be an implementation detail - can anyone shed a light on this?

Greetings and thanks for your help!

Nico

MichaelEischer · February 6, 2020, 8:56pm

Restic uses the latest snapshot as reference that was created by the same host and with the same backup paths, so this is the expected behavior of the current implementation. You can use --parent <snapshotid> to explicitly specify a snapshot to use as reference.

I’m not sure whether there’s a specific reason for this behavior or whether it’s just the simplest implementation that works for most use cases.

cdhowie · February 6, 2020, 9:16pm

Note that this has no effect on the outcome of the backup – it’s purely an optimization. The parent is used for comparing file metadata (size, mtime, etc.). If the metadata has not changed from the parent, restic skips scanning the file’s contents.

Without a parent snapshot, restic will scan the contents of each and every file, which will make the backup take longer – but the contents are still deduplicated.

scopev24 · February 7, 2020, 9:21am

Hi there,

thanks!

I can see that with this strategy of selecting a snapshot in order to optimise the procedure (as in knowing which files are untouched etc.) there’s no other way than what I encountered.

At first glance I thought it should be possible to just use the snapshot with “the most paths matching” - but then again, what would be the selection criteria? E.g. there are two snapshots, with each having the half of the paths of what I want to backup… That would create an optimisation problem for itself.

So the takeaway is, that the optimisation of not having to actually re-read all the files depends on selecting index information about a past snapshot based on the criteria host + paths. I changed that, so no optimisation possible.

Thanks again!

Greetings

Nico

cdhowie · February 7, 2020, 5:17pm

No automatic optimization is possible, but you can always use --parent to disable automatic parent selection and force a specific snapshot to be used as the parent.

scopev24 · February 7, 2020, 5:51pm

Nice catch!

This basically solves my scenario (a.k.a. the fear of waiting for restic to re-read everything just because I added a folder to my --files-from list)…

I just tried it and it works great :

Making a snapshot with a path list using --files-from
Adding something to the list and make another snapshot using the previous snapshot as parent
only the new paths are considered “new”
subsequent snapshots will again be “optimized” automatically (as long as the list remains untouched, that is)

Greetings

Nico