Efficient check for unchanged source?

mrodent · March 25, 2021, 3:22pm

In order to know whether a new snapshot of a location is actually necessary (i.e. whether any files or directories have actually changed), with my rsync scripts I do a “dry run” to decide whether a snapshot is needed. In fact there are other possibilities, such as (Linux) diff, or a Python script.

In fact neither the rsync “dry run”, nor diff, are entirely satisfactory, because in neither case can you say “stop the process if you detect a difference”. If you’re just interested in finding out whether in fact there is any difference, this would be a good option to have.

I can’t see any of the restic commands which let you do a “dry run” (and your restic diff command obviously compares snapshots). Is there anything I’ve missed? Do you think you might consider implementing this at some point?

I.e. in git terminology, something to detect whether “everything is clean” between the working (source) files and the last commit/snapshot.

gurkan · March 25, 2021, 7:49pm

Dry run option is included in risky operations only, like forget and prune.

But why not just add another snapshot in any case? If there is no difference, it won’t add anything to the repository (except the snapshot file, which won’t cause any issue or actually increase size repo size).
In order to compare two data sources, you’ll need to travel/hash them anyway, so just using backup would be doing this imho.

I might be also missing something but operationally that was what I saw, maybe devs would correct me.

mrodent · March 26, 2021, 6:52am

Thanks, makes sense on one level.

After thinking about it I realise that restic is significantly different to rsync (used with --link-dest, i.e. hard links), in the sense that if you did an rsync backup, even if things had not changed, you would be creating thousands of hard links (which do occupy a finite disk space, and do take time to create and also to delete), whereas with restic this doesn’t apply.

BUT… in terms of pruning of old snapshots there seems to me to be a potential issue: if you are doing hourly snapshots, and you decide that you want to keep the last 5 hourly snapshots where things have changed, how would you do that? It wouldn’t matter if in fact over the past 10 hours you had done one or two “no change” snapshots, but how do I in fact know when the quota of 5 “things changed” snapshots has been reached, and also how do I know which is the oldest “things changed” snapshot?

I think I would want to identify the oldest “things changed” snapshot, and then prune it, but also prune any “no change” snapshots which were older than the next most recent “things changed” snapshot (i.e. the oldest one which I was planning to keep for the moment).

If there’s no way I can distinguish between “things changed” and “no change” snapshots, pruning (i.e. in this example, automatically pruning the oldest snapshot where there are 6 of them) might in fact leave you with 0 “things changed” snapshots. This would arise where you hadn’t changed the source files over the past 5 hours, but the restic snapshot hourly job had continued to function.

Of course, with restic I can’t just run a diff-type command, as I can with rsync, because everything’s cleverly packaged in a repository (and encrypted).

Maybe the answer to this problem is what you referenced, the “dry run” thing in prune. Or maybe the info about how much additional disk space was needed by a given snapshot is kept with it in the repo? If this was 0 you would obviously know this was a “no change” snapshot.

I’ll look into these things today, hopefully.

764287 · March 26, 2021, 9:34am

restic does show stats after each backup when not run in --quiet mode:

repository c331bc3c opened successfully, password is correct
using parent snapshot c448cedd

Files:           0 new,     7 changed,   834 unmodified
Dirs:            0 new,     9 changed,   278 unmodified
Added to the repo: 124.094 KiB

processed 841 files, 151.171 MiB in 0:00
snapshot 6c52d583 saved

If you intend to parse the output of these stats you should use --json as it’s more reliable:

restic --json backup ./ | jq 'select(.message_type=="summary") | .files_new, .files_changed, .dirs_new, .dirs_changed'
0
7
0
9

doscott · March 26, 2021, 9:35am

This post shows two methods of getting a diff on two snapshots:

With either method you should be able to parse the output to check if changes were made and then forget the snapshot if no changes were made.

This is the output of the first method:
comparing snapshot 58f478ad to e7670c62:

M    /home/dos/.bash_history
-    /home/dos/gitlab/backups/1616036714_2021_03_18_13.4.4_gitlab_backup.tar
-    /home/dos/gitlab/backups/1616123114_2021_03_19_13.4.4_gitlab_backup.tar
+    /home/dos/gitlab/backups/1616727914_2021_03_26_13.10.0_gitlab_backup.tar
-    /home/dos/gitlab/secrets/etc-gitlab-\1616641514.tgz
+    /home/dos/gitlab/secrets/etc-gitlab-\1616727914.tgz
M    /home/dos/mysql/APM_Forum.sql
M    /home/dos/mysql/mysql.sql
M    /home/dos/mysql/scottfamily.sql
M    /media/data/documents/Finance/MyMoney.kmy
M    /media/data/documents/Investing/Investing.ods
M    /root/.dbus/session-bus/c8cb0a3a1e0e4971b0874c39f9bbec3c-0
M    /root/.vnc/config.d/vncserver-x11
M    /root/.vnc/config.d/vncserver-x11.d/BootstrapCache.pkg
M    /root/.vnc/config.d/vncserver-x11.d/CloudCredentials.bed
M    /root/.vnc/config.d/vncserver-x11.d/RegionCache.bed

Files:           2 new,     3 removed,    11 changed
Dirs:            0 new,     0 removed
Others:          0 new,     0 removed
Data Blobs:     18 new,    23 removed
Tree Blobs:   6356 new,  6356 removed
  Added:   64.984 MiB
  Removed: 72.997 MiB

mrodent · March 26, 2021, 10:55am

Thanks both. I was aware of the stats printed when you do a backup, yes.

But I think restic diff is the way to go.

Also, a crucial gotcha occurs to me using the stats printed at backup: I said

If this was 0 you would obviously know this was a “no change” snapshot.

But of course that wouldn’t be true if the deletions freed up exactly as much space as the additions took up, which would also show 0.

Whereas parsing the output from restic diff you can analyse both “Added” and “Removed”. Conceivably you might also even have to look at the other diff stats in some edge cases or other: not sure.

mrodent · March 27, 2021, 11:41am

Finally, for my purposes, I think it is actually better to analyse the output from backup. If you use the --json switch, as mentioned by 764287, and then work out how to process the result from subprocess.run, something like this:

restic_result=subprocess.run(['restic', '-r', repo_location, '--verbose', '--json',
    '--tag', snapshot_frequency, '-p', pwd_file, 'backup', source_location, ], 
    stdout=subprocess.PIPE, stderr=subprocess.STDOUT, timeout=60).stdout 
restic_result=restic_result.replace(b'\x1b[2K', b'').decode('utf-8').splitlines()
for line in restic_result:
    if line:
        try:
            json_obj=json.loads(line)
        except json.decoder.JSONDecodeError:
            if line.strip()=='Fatal: wrong password or no key found':
                logger.error(line)
                sys.exit()
            else:
                logger.info(f'backup command returned this line:\n|{line}|')
            continue
        if json_obj['message_type']=='summary':
            ...

… then you find that in fact the info provided is more than just “Added to the repo”, so you can do this:

if json_obj['files_new']==0 and json_obj['files_changed']==0 and \
    json_obj['dirs_new']==0 and json_obj['dirs_changed']==0 and \
    json_obj['data_blobs']==0 and json_obj['tree_blobs']==0 and \
    json_obj['data_added']==0:
    logger.info('...seems to be a no-change snapshot')

You can then use the (8-figure start of the) SHA number returned here in json_obj to prune this “no change” snapshot, leaving an uncluttered repo, where all your snapshots are going to be “things changed” snapshots.

mrodent · March 28, 2021, 8:41am

Hmmm… yet another subtlety occurs to me here: just because all these numbers in the last post are at 0 doesn’t necessarily mean that you should discard this snapshot!

For example, if you deleted a file, did a snapshot, and then reinstated the file, and did another snapshot: past snapshots would mean that this blob (or blobs) for this file would already be present in the repo.

I’m not clear whether the value for “files_changed” would then be 0… what exactly does “files_changed” mean when you do a backup? Relative to the most recent snapshot? Regardless of tags? Or does it cleverly identify a file which has changed, but not so much as to qualify it as a “new” file (in the manner of git’s algorithms)?

On balance maybe using restic diff might be safer until I understand more about restic.