Indicate which files have changed with --files-from

chrestomanci · February 3, 2019, 3:36pm

I am currently using restic to backup an email archive that lives on a btrfs filesystem with snapshots created using snapper. The snapshots make it easy find out which files have changed since the last backup and pass the list to restic using a --files-from argument.

Example:

snapper -c mail_dirs create --description snap_for_daily_backup
snapper -c mail_dirs list | grep snap_for_daily_backup | cut -d '|' --fields=1
snapper -c mail_dirs diff --diff-cmd "/usr/bin/diff --new-file --brief" 0..1416 > /tmp/file_diff_list
cat /tmp/file_diff_list | perl -lne 'print $_ if s:Files (.*) and (.*) differ:$1:' > /tmp/files2backup.txt
restic backup --verbose -o b2.connections=20 --limit-upload=3072 --files-from /tmp/files2backup.txt

The sequence of commands above typically return about one hundred email message files that have changed in the last few hours. As expected the backup of just those messages is quick because restic does not have to examine half a million files to discover that most are unchanged.

My problem is that I now have snapshots in my backup that lists all those file paths, instead of just another neat backup of /home/mail_dirs like all the other backups.

Is there a way to run the backup so that it will be stored as another snapshot of the same directory, but to pass in a list of file that I know have changed in order to bypass restic’s directory walking code?

764287 · February 3, 2019, 5:00pm

The benefit of restic’s backup approach is that every backup is essentially a full backup without the space requirements of a full backup. You are now changing the backup process to work like a incremental backup because you want to save 10-120 seconds of scan time? How are you planning to restore those snapshots?

chrestomanci · February 3, 2019, 6:19pm

The scan take a lot longer than 120 seconds. (Note that I am using a B2 backend which is fairly slow)

If the list of changed files is correct, then backing up just those files will be equivalent to running a backup on the top level directory, and letting restic work out which files have changed. What I am asking is for restic to trust the supplied list of files as correct, in order to save the scan time. (In corner cases, It is more likely to be correct than any scan that restic could perform in reasonable time, because btrfs knows when a file has changed even if the timestamp or other metadata does not change).

An alternative approach would be to put in a feature request for restic to make use of snapshot specific APIs in snapper and btrfs to achive, but I know that the author is trying to make restic as platform neutral as possible, so is very unlikely to be in favour of a feature that only works on one filesystem under one OS.

fd0 · February 3, 2019, 7:40pm

The time used for the scan does not depend on the backend: almost all data needed is cached locally. Is it maybe possible that the cache directory is not retained?