Pruning smarter / Best way to use tags?

thedaveCA · October 21, 2021, 8:56pm

I’m looking for some advice about how people are using tags, or whether there is a better approach to forgetting old snapshots than what I am doing.

restic has been a fantastic set-it-and-forget-it solution for me, with the one exception that I can’t quite figure out how to automatically forget content. I have a set of scripts across various machines that do backups at their own convenience/schedule, in some cases users a --files-from is generated automatically.

My goal is to do something like this:

restic forget --tag userdata --keep-last 3 --keep-daily 7

I was using tags based on the type of content (configuration, logs, userdata, vm-images, etc) with the intention of being able to do prune by content type.

Where this falls down is when a user adds a folder to their list (as provided by --files-from or a wildcard), it causes the prior snapshot to be retained forever as restic sees it as unique.

For example, I have this snapshot that I wanted deleted years ago:

dfae38ff  2019-10-21 11:04:22  yar.example.com  userdata     /mnt/backups/important/bob
                                                             /mnt/backups/important/joe

But it will be retained forever, because the subsequent snapshot added a new path, which is seen as unique:

`dfae38ff  2021-10-21 11:05:55  yar.example.com  userdata    /mnt/backups/important/bob
                                                             /mnt/backups/important/joe
                                                             /mnt/backups/important/sue`

In reality there is nothing special about dfae38ff that would make me want to retain it – I understand that when a path is removed, the situation could be different in some cases, although for me it is all the same.

I’m guessing the best way to handle this would be to use unique tags for each “job”, and then in forget use --group-by "host,tags"? Is there a better way to handle this? Or is there a better way to use restic as a whole?

rawtaz · October 21, 2021, 9:21pm

I agree with your assessment. Although nowadays I only use tags for the grouping. The reason is that my clients changed their hostname depending on which network they were on, so that messed it up for me. But now, I just set a specific tag per client, and group on tags, and forget with a simple policy like yours. You might want to do the same, unless you know you have consistent hostnames (then again, what’s the point of grouping on hostname too, if each one of your jobs have its own tag anyway).

thedaveCA · October 21, 2021, 9:40pm

I do have consistent hostnames. Actually now that I think about it, technically some jobs can run from different hosts (against the same data), but it isn’t happening regularly.

I like having the host field, but I wonder if I need it? Maybe I should repurpose this field for the “job” so that I can continue to use tags to assign retention policies? This would work well for jobs that technically can run from different hosts.

rawtaz · October 21, 2021, 9:42pm

You can have multiple tags in snapshots, but it should be fine to reuse the host field as well - it doesn’t have to be an actual hostname as you can set it to whatever you want basically.

ctonsing · October 22, 2021, 8:34am

Interesting discussion. My situation is similar: for various reasons I need multiple tags per snapshot, to link it to a certain backup task and to identify the specific execution of the task. I fix the hostname so that even if the machine should be re-created with a different actual hostname and the backups fired up again, it would still report the same hostname to restic. Then for the forget operation, I group by host only. This enables me to execute the retention policy across all snapshots in the given task irrespective of the paths which may change over time for the task.