I’m currently in the process of evaluating restic - love it so far
One thing I noticed:
I do a test backup like this:
./restic -r /home/nscheer/backup --verbose=2 backup /home/nscheer/data/test_1
On a fresh repository, I can see the new files being backuped. On subsequent backups, every file is considered unchanged.
Now I add another directory like this:
./restic -r /home/nscheer/backup --verbose=2 backup /home/nscheer/data/test_1 /home/nscheer/data/test_2
Now every file (even from test_1) is recognized as new, resulting in a complete re-read.
That’s not a problem for my small test set, but I’m thinking of a huge backup where I just wanted to add a directory with very little content. This would result in a complete re-read of the files.
Is this the expected behavior? I can’t see an obvious reason, that when building some kind of file list internally, why the files already in the list should be considered “new”, even if the added directory does not share any path with the already existing one.
I guess this must be an implementation detail - can anyone shed a light on this?
Restic uses the latest snapshot as reference that was created by the same host and with the same backup paths, so this is the expected behavior of the current implementation. You can use --parent <snapshotid> to explicitly specify a snapshot to use as reference.
I’m not sure whether there’s a specific reason for this behavior or whether it’s just the simplest implementation that works for most use cases.
Note that this has no effect on the outcome of the backup – it’s purely an optimization. The parent is used for comparing file metadata (size, mtime, etc.). If the metadata has not changed from the parent, restic skips scanning the file’s contents.
Without a parent snapshot, restic will scan the contents of each and every file, which will make the backup take longer – but the contents are still deduplicated.
I can see that with this strategy of selecting a snapshot in order to optimise the procedure (as in knowing which files are untouched etc.) there’s no other way than what I encountered.
At first glance I thought it should be possible to just use the snapshot with “the most paths matching” - but then again, what would be the selection criteria? E.g. there are two snapshots, with each having the half of the paths of what I want to backup… That would create an optimisation problem for itself.
So the takeaway is, that the optimisation of not having to actually re-read all the files depends on selecting index information about a past snapshot based on the criteria host + paths. I changed that, so no optimisation possible.