Finding big snapshots

Let’s describe the use-case I am trying to solve:
I do very frequent snapshots of my ~ and regular snapshots of /. I have a small disk I backup to, but I can do this because of restic’s awesome deduplication (Thanks for an awesome tool!). My retention periods are --keep-within 2d --keep-daily 14 --keep-weekly 16 --keep-monthly 18 --keep-yearly 3

Sometimes I have larger files I download but only keep for a while. For example compressed multi-gigabyte log dumps, steam games I only play for a while, iso images (sparse, but anyways).
I excluded my steam games(*) since they can easily be redownloaded and don’t need a backup but for some files. If I think of it, I abuse an exclude for downloads that I don’t want in my backup. (I could also make a separate Downloads-nobackup folder.)

But I don’t know that for everything before I allocate disk space to it, or I just forget it will be backed up. So there is probably are probably huge files in some snapshots that are irrelevant but were only there for a week or two. If I am lucky my retention will just get rid of it eventually but if I am unlucky a unnecessarily large snapshot will end up as “monthly” snapshot and last quite a while.

If I know the files name I could probably use restic find but I was looking for a way to find unknown large files (or directories).

I have two approaches in mind:

  1. Using restic diff I could check if there is a spike in added space somewhere but that doesn’t mean it was removed at another point.
  2. looping over every snapshot, every tree, every subtree, every file and comparing content blob strings if some only appear in a handful of snapshots.

Do you have any other ideas how to approach this? Is this maybe something somebody already did? Is this something interesting for more people than just me? Is this something restic would like to manage itself? If this is something totally alien to you, how could I approach a solution for myself; like how should I program it?


(*) Actually I excluded the entire steam directory instead of just the steamapps which I regreted recently due to a steam issue and the need to restore my steam config. Recent Steam bug

Would probably be easier to backup the download and temp folders to another repository with a different retention policy. That makes the pruning easier (having said that, I would agree it would be great to have more statistical visibility into the data changerate/sharing by snapshot). It does seem to be scriptable to some extend, especially when you know where to look for large files and if you dont care about the actual re-use of the data blocks and can go by file size “Only snapshots 1001 and 1002 contain 1GB file x.download”. (there wont be a gurantee that deleting those snapshots are the only consumers for that 1GB worth of chunks).

1 Like

I run a restic diff on the last two restic backups. The backup runs then the diff runs. I review the diff output to see that there are the files that I’ve changed recently. Because it is a Windows machine, the diff also shows how much junk Microsoft shoves onto my personal machine just because Microsoft treats my machine like they owned it. You would need to script the output of the diff to look for files with a size > x. Use the json output.

1 Like

I like the idea of making sure nothing too big got added. Not sure why but --json didn’t work for me on diff. But the output seems parse-able enough. I was hoping to get size information in the diff json.

Filtering out some lines in the diff should be fairly easy with some Powershell commands. I think I only ever modified two Powershellscripts but I am sure there is something like grep.

Hm, I like the idea and the Downloads folder is not exactly useful for longer retention.

It wouldn’t even need to be another repository since I could use the path or tags to apply different retention policies. The forgets would be different but the prune could still run once.

Thanks for the idea!