Retain Quarterly or Semi-annual backups with Forget


#1

With forget, for me the transition between retaining monthly and yearly backups is a bit too abrupt. Is there any thought to adding a Quarter time period? Ideally, I’d like to retain something like 13 months, 13 quarters, 99 years. Semi-annual would be a nice option to have as well (implemented as 2 quarters?).


#3

To be honest, I would prefer no more “special case” flags and just a generic “keep X backups every Y duration” option. For example, we could call it --keep-policy '13 of 3 month' to implement your “quarters” concept. Then, the built-ins are just shortcuts:

  • --keep-hourly $N is --keep-policy "$N of 1 hour"
  • --keep-daily $N is --keep-policy "$N of 1 day"
  • --keep-weekly $N is --keep-policy "$N of 1 week" (or 7 day)
  • --keep-monthly $N is --keep-policy "$N of 1 month"
  • --keep-yearly $N is --keep-policy "$N of 1 year"

@fd0, what are your thoughts on this concept? (We can nitpick the syntax later; I just came up with something quick as an example.)


#4

No, there isn’t. I found it very surprising that getting retention rules right was way more complex than I ever imagined. I think we have a good compromise here, adding more cases makes it even harder to understand what restic does.

If you like to have different needs for a retention schedule, you could take restic snapshots --json, parse that in a script and then call restic forget on the snapshots you’d like to forget. Would that work for you?

I agree, strongly.

I fear that this will be even harder to understand than the current solution restic offers. The options we offer seem to work for most people, so I’m hesitant to add anything else (which need to be maintained and documented and tested and so on).


#5

Thinking further, I wonder if I am stuck in old school backup thinking, where disk space is a significant factor in choosing a backup solution.

Does it really matter if I have 4+ years of monthly backups instead of 13 months plus 13 quarters, given backups are incremental? That’s ~50 backups instead of 26, a significant number but is it a deal breaker? Probably not in my use case.

Thanks for your json solution, which is neat. I’ve only used forget on the head or tail of the list of snapshots, and the incremental nature confuses me sometimes: is it ok to remove snapshots in the middle of the list, i.e. given snapshots a b c and d, to remove b? How will c link back to a?


#6

I would think it depends very much on the nature of the data - if it’s mostly unchanging, then deduplication should take care of it. If there’s a lot of change, it might make more of an impact. It’s probably worth measuring, though.

If you think about it, the way forget works kind of implies that dropping a snapshot from the middle must be safe - otherwise, how could it throw away e.g. six daily snapshots between weekly snapshots? Or 23 hourly ones between the dailies? As I understand it, the snapshots aren’t actually linked, they’re each a full list of data chunks belonging to the snapshot and the “parent snapshot” is only used to speed up the process.


#7

Yes.

Restic implements a content-addressable filesystem similar to Git. It’s basically a big pool of objects: file contents (blobs) and directories (trees). If a file hasn’t changed between backups, its blob(s) have the same IDs and hence are not re-added to the repo since they already exist.

When removing a backup, you’re not actually removing the data from the repository, just a pointer to a particular tree (the root directory of the backup). Snapshots aren’t actually related to each other at all, except that they probably reference many of the same objects, which is how deduplication is implemented in restic.

Prune is what actually purges data, and it does so by mark-and-sweep garbage collection: it reindexes the repository (in case the indexes are bad – we don’t want to assume the indexes are valid if we’re going to be deleting things!) then crawls each snapshot’s root tree and marks each encountered object as “used.” When this is done, the objects that were not marked are not referenced by any snapshot and can be discarded.

This algorithm works no matter which snapshot is removed, whether it’s the newest, oldest, or somewhere in between.