Snapshot rewrite command

For a few days I’ve had the idea of a command that is based loosely on git filter-branch, the idea being that you would pass it a list of snapshots IDs or snapshot filters (--host, --tag, etc.) and also a script that would be run for each snapshot.

The script would need to be written in some language that could be embedded in restic (Lua and JavaScript come to mind). The idea is that this script would be run once per snapshot and could perform mutating operations on these snapshots, such as deleting/renaming/copying files/directories, perhaps even performing operations on their contents.

Any changes could result in the introduction of (possibly-deduplicated) trees and/or blobs to describe the modified snapshot, as well as the new snapshot itself. An option to the subcommand could be used to indicate whether the old snapshots should be removed or kept after the operation (something like --forget-old or --keep-old… possibly even both switches and make it mandatory to specify which). As with forget, the command could also accept --prune to automatically prune if any changes were actually made.

I’m thinking particularly of use cases where sensitive data is accidentally backed up when it shouldn’t have been, and old snapshots need to be scrubbed of this data, but we don’t want to lose the whole backup.

A less-important but still valuable use case is when a large backup is completed and you noticed that there’s a directory you should have excluded, so you add it to the backup script but don’t want to interrupt and re-run the backup. A simple rewrite script could drop those directories from the snapshot.

The current way to do this is to restore each snapshot, make the changes, and re-run the backup. This is time-consuming, inefficient, and needlessly wears storage devices.

5 Likes

This would overlap a bit with the planned purge feature, no?

It looks like there is some overlap; the feature I’m proposing is a more generic mechanism that could be used to accomplish the same things, but also more things.

Just out of curiosity, would good old sh not be the best option for scripting language here?

  1. Because sh is very error prone in terms of handling file names with spaces/quotas, etc.
  2. As far a s I understand this proposal is not just about restore+modify+backup approach. It should handle this on index level without downloading/uploading stuff. So there will be no real files on disk. And good old sh is very limited support of complex data structures like arrays.

Just one suggestion. I think that one major use case for this command is to remove file/directory that was not excluded during backup by mistake or due to some other reasons. Most certainly this means that this error is already detected and exclude rules were already updated so that all further backups are correct. And now user just wants to reapply these rules to previous snapshots.

So I think that it’ll be cool to support exactly same command line like backup command. So just be able to copy/paste existing --exclude args will be good enough.

PS. At the same time I understand that this is not enough. If at least some excludes are managed by CACHEDIR.TAG or similar files (oops, found another directory with garbage case) then some advanced approach will be needed

Seems good enough for the git filter-branch examples linked above, not to mention all of POSIX computing since the 1970s… :wink:

Edit: There’s a good reason a language other than sh would be inappropriate to use here. Consider as analogy this forum: it’s a tool for forum communication, written in ruby, however a knowledge of ruby is not necessary to communicate. The only knowledge required is how to use a web browser. If a user suddenly needed to know ruby in order to communicate, the forum’s value as a tool for communication would diminish.

Likewise, restic is a tool for performing/manipulating backups, written in go, however the user only needs to know how to use the terminal (i.e. how to use sh) and enter commands and switches in order to get full use. If the user suddenly needed to know go or javascript or lua in order to perform or manipulate backups, restic’s value would diminish.

I know some other programmer types might take a selfish perspective, i.e. “since I know javascript/lua, and since I think javascript/lua is the best choice, then all users should just learn javascript/lua” but we are, of course, kinder people than that.

But, I think this is sidetracking @cdhowie’s thread, so, close parenthesis…

But restic runs on more systems than POSIX. This works for Git because Git declares a dependency on a POSIX shell, so when installing Git on Windows you need to drag in a bunch of Unix tools (with bash usually among them).

Restic is distributed as a standalone, statically-linked binary on non-POSIX platforms. Whatever language restic would use for this feature would have to be provided either by all platforms, or be provided by the restic binary itself. POSIX shell seems like a bad choice in this scenario.

There’s no reason that, given examples, this approach wouldn’t work for everyone. For example, is --script="unlink('/exclude/this') unlink('/and/also/this')" really so hard for a non-programmer to understand and use?

And I’m not against having some sugar options such as --exclude that would result in restic building the script itself.

Hmm I see your point for non-POSIX systems, I hadn’t thought of that. Conversely, hopefully this should make my point clearer :wink:

It does not. Someone who is capable of using a command-line tool such as restic can surely understand what “unlink” means. If they don’t then I question whether they can even understand what --exclude means.

There’s making software more accessible, and then there’s insulting the intelligence of our users.

Hmm… Okay, so I’m sure you get that what seems obvious to you is not obvious to other people. The problem I’ve seen that some other programmers have with this idea is they then conclude that this means they’re just plain smarter than other people, and if these other, stupider people can’t understand, then they just deserve their own confusion, etc. Of course, if these programmers were as smart as their egos lead them to believe, they’d recognise that it’s much more likely the hundreds of hours of working with code that makes this or that thing seem obvious to them while eluding other people.

But, tbh I doubt this will have any effect on the path of developing this feature, and all I’m doing is providing a distraction, so I’ll just say, maybe rather than setting the goal of how you can make a thing more powerful or more clever or impressive or complex, a worthy goal could be how can you make it more simple or more friendly?

1 Like

I said this before, did you miss it?

The feature can be designed to be both powerful but provide simple options for non-programmers.

Not sure if I see the connection you’re making there, but main thing is i’m sure you’ll do a great job, and I look forward to filtering my backups to remove a lot of junk :sunglasses:

1 Like

FYI, restic now has a rewrite command: Implement 'rewrite' command to exclude files from existing snapshots by dionorgua · Pull Request #2731 · restic/restic · GitHub

Usage: Working with repositories — restic 0.14.0 documentation

6 Likes

“Three years later…this feature arrives”. This is the sort of careful and considered approach which makes restic so trustworthy.

I note that the old snapshot is left as-is, and a the new is tagged with “rewrite”. If I were to rewrite 100 snapshots, but later found I had omitted to use --forget, the current behaviour would make it very difficult to easily remove all of the old snapshots. I can refer to the new ones by tag - so can easily handle them at once. I have no way of spotting old snapshots, which is a pain since I only ran rewrite to remove spurious data.

I’m assuming that’s because adding a tag to an existing snapshot changes its ID, but also could mean it may likely be left out of being selected by grouping later on.

Has there been discussion of how - if it’s even possible - to make it easy to handle all old, rewritten snapshots? I may have missed this.

2 Likes

There’s currently no really easy way to match old and rewritten snapshots. The rewritten snapshots have a field original (unless that field was already set before) which points to the old snapshot. That might help. It’s probably also a good idea to open a feature request on Github to discuss how to make it easier to manage old and rewritten snapshots afterwards. I’m not aware that the problem has already been discussed in detail.

The underlying problem is that the snapshot ID is a hash of the encrypted snapshot data which therefore will always change when the snapshot is modified.

I shall do just that, thanks. I did struggle to try to explain myself clearly but must have gotten through :slight_smile:

I hadn’t realised that was the reason for snapshot IDs changing, a handy tidbit.