Clarification about forget flags

I don’t use the forget flags “–keep-hourly”, “–keep-daily” currently for the following reason. I want to know whether I understand what these flags do correctly.

At the moment the way I organise my snapshots is to run a Python script as a scheduled task (every 10 minutes) which adds or removes tags for given frequencies based on analysed differences (analysed using restic diff) i.e. “10-minute”, “hour”, “day”, etc., allowing a maximum number of each frequency to persist …

My script means that a new tag of a given category will only be added to the new snapshot only if differences have been detected relative to the most recent snapshot of that frequency category, so it’s not “every hour unconditionally”, which would be less desirable.

At the end of the script the final job is to delete any snapshots which now have no tags (as a result of no tags being added, or tag removals). This/these will be snapshots which are the “oldest in frequency category”, or could be the latest snapshot, if difference analysis shows that it is superfluous for all frequency categories.

As I understand it, using the available off-the-shelf forget flags you might end up with 5 identical “hour” snapshots. But my last 5 “hour” snapshots could stretch back over days, depending on what has changed in my directory structure. A comparison (diff) must be made with the previous snapshot (of a given frequency) to know whether there is any reason to add the new snapshot. Adding a new one obviously also means discarding the oldest one: if all 5 are identical at that point this means you would lose potentially valuable information: it would impair the usefulness of having multiple snapshots, which is much of restic’s genius (and beauty).

Have I understood this correctly? Are these available forget flags indeed “dumb”? Or do they in fact incorporate a comparison mechanism?

Have you understood what correctly? I think you need to give a concrete example of snapshots and what policy arguments you supply to forget to get a useful answer. But then again, you can just use the --dry-run option and actually see for yourself what forget would to with your snapshots given various flags. Can you do that?

Other than that, is there any particular part of Removing backup snapshots — restic 0.14.0 documentation that is unclear? If yes, which part and in what way is it unclear?

I’ve indeed read that page… and from your answer it appears that the answer is indeed yes, these flags are “dumb”, i.e. they make no attempt to see whether the newly created snapshot is indeed needed. That page does not discuss this point.

I repeat, the problem with this is that this impairs the usefulness of having multiple snapshots, which is much of restic’s genius (and beauty). Saving multiple snapshots which are identical, and at the same time thereby deleting the oldest n snapshots in a given frequency category, undermines the purpose of restic and inefficiently discards potentially valuable information.

Fortunately solutions to this can be found in the boundless scripting possibilities offered by Python (for example), and the fact that restic offers diff.

Well, the duration/time related --keep* options to the forget command are about applying a policy based on time, not something else. I guess to you that means they are “dumb”. The entire point of how snapshots work is that they record the state of the data you back up at a given point in time, so that’s what they’re doing.

At the same time, please realize that even if you have multiple snapshots that refer to the same data having been backed up, that is only a matter of them taking up a few KB of data in your filesystem, so there’s really no practical issue in that regard.

Not sure what answer you are looking for :slight_smile: Then again I might not really understand what you are actually asking :slight_smile:

Yes, multiple identical snapshots (e.g. shall we say 5 “hour” snapshots) seems harmless.

But think about it for a moment: this means that, say I do a big edit at 10.30 PM and then leave my machine overnight, when I come back next morning there will be “hour” snapshots at 3 AM, 4 AM, 5 AM, 6 AM, 7 AM… but the valuable snapshot from 10 PM the previous night will potentially be lost forever.

But my script wouldn’t do that: it would see at 11 PM, midnight, 1 AM … until 7 AM that nothing had changed … so would not add any new snapshot or, crucially, forget the 10 PM one.

The big edit you made at 10:30 PM and that was backed up at I guess 10:40 AM or whatever (assuming you back up every ten minutes) will be referenced in all of those snapshots. So even if you were to delete all of the snapshots from that night, except e.g. the one from 7 AM, that 7 AM one that you kept will still contain/reference that last version of the file you changed.

But, I don’t see why you’d need to remove snapshots this fast. Why not keep all snapshots (i.e. every ten minute ones) for the last two days, hourly snapshots from the last week, one daily snapshot for say two months back, one monthly snapshot for the last half year, and even one snapshot for the last two years? Something like this is pretty normal, and I don’t really see why you’d end up in a situation like the one you describe.

I know you’re trying to be helpful, but I think you are perhaps missing the point here. The 10 PM snapshot in my example would definitely be lost forever. Of course I have multiple different frequencies (and there might be many more than 5 snapshots): we’re getting lost in the details here.

I’m making a simple point about the fundamental inefficiency of mechanically and unconditionally adding snapshots to the repo (and thus necessarily discarding old ones) when that serves no purpose and inefficiently discards potentially valuable information in an undesirable way.

Okay, let’s break this down so we clear it up.

Snapshots are added when you run backups. I guess you’re saying that if nothing changed you don’t want a snapshot created? If yes, okay, that’s a fair opinion, but that is not how restic was designed. It was designed to record the state of your files at the time when you run a backup, regardless of what in it changed or not. If nothing changed, having an extra snapshot for that backup run is not a practical problem. If you think it is, please explain in what way that is a practical problem?

Not sure what you mean by this - snapshots are only discarded/forgotten when and according to your own control, restic doesn’t remove snapshots by itself, as we both know. Can you please clarify what you mean here?

What valuable information is lost?


I’m not trying to dismiss what you’re doing, but it’s to me a bit unclear what you think is an actual problem here.

Not sure what you mean by this - snapshots are only discarded/forgotten when and according to your own control, restic doesn’t remove snapshots by itself, as we both know. Can you please clarify what you mean here?

So I have decided to keep the last 5 (or 25 or 2500) “hour” snapshots. If I already have 5 (or 2500), and I add a new one, the idea is that the oldest is then forgotten, right? If the last 5 are all the same then, in my system, that oldest one would not be forgotten at that point, because diff would have told the script that nothing had changed. I can’t explain it any more clearly than that.

that is not how restic was designed. It was designed to record the state of your files at the time when you run a backup, regardless of what in it changed or not.

If, in my example posted above, I want to find out the state of my source files at 3 AM, I simply backtrack to the previous snapshot chronologically, which would be 11 PM. Secure in the knowledge that nothing had changed between 11 PM and 3 AM.

Honestly, I think I can’t explain it any more and am off to have my dinner. :grin:

The forget flags are time-based. They don’t look at the content of a snapshot.

First off, we are not talking about diff here, since that is something you are doing in your script. We are only talking about how restic works with its options to the forget command. So the part where you mention diff I can’t really comment on. The forget command does not use diff, as you know.

If you have five snapshots, and then add a new one, you will end up with six snapshots. Restic does not automatically remove the oldest (first) one and then leave you with four of the older ones and the new one being the fifth snapshots. It only removes snapshots when you run forget. I guess you knew that, but just to clarify.

So let’s then assume that with your six snapshots, you run forget in such a way that it removes the oldest of these six, leaving you with the five most recent snapshots. Since we’re not talking diff here, I don’t really know what you tried to say with “that oldest one would not be forgotten at that point, because diff would have told the script that nothing had changed”.

Again, not sure what the problem is with this part.

If you want to find out the state, I guess for the sake of discussion that what you mean is “open the file and see what’s in it”, then you do not need to go back to the 11 PM snapshot. You just open the file in the 3 AM snapshot, since that is the point in time which you want to see the file’s state in. If you want to see the state of the file at 3 AM, why would you instead go look at what it was at 11 PM? That makes no sense to me :slight_smile: