What *technically* happens when running forget/prune?

I’m pretty certain I have a good grasp on how restic works its magic, but I just want to make really sure for one point in particular.

Let’s say you have 10 backups of one certain directory (only containing files), 1 for every day, so 10 days worth of backups. Now let’s say that on the 4th day (before the backup runs) I add a new file to the directory and keep it at least until the 10th backup. Let’s also say (for the sake of clarity) that the contents of the files don’t change at all. Then I decide to delete the 4th snapshot. Now what I think happens:

  • Restic checks what files** were only in that backup and delete those.
  • This means that the blobs added before the 4th backup have now “physically moved” into the 5th snapshot and are referenced by backups 6 thru 10 (because of deduplication, this data is not actually present in subsequent backups). I do understand that even snapshot 5 simply refers to an index for this data, but it would still be the “origin” snapshot of the file.
  • When I restore the 5th snapshot, this new file should also/still be there.

Is this (largely) correct or doesn’t restic work like that at all? I’d like to see a technical explanation of what exactly is going on when restic deletes snapshots.

You are mostly correct except the “physically moved”, I would say :wink:

More detailed it goes like this:

  • snaphots only reference the tree which itself again only references subtrees and/or file contents
  • in your example, snapshot 1 to 3 reference all the same tree and 4 to 10 all another tree where only a references to the contents of the additional file has been added
  • removing any snapshot does not delete any tree or file contents
  • when running prune directly or indirectly via forget --prune restic searches for all trees and file contents that are no longer referenced by snapshots (or trees referenced by snapshots or sub-trees referenced by trees, and so on…). Those tress and file contents are then removed.
  • this implies, there is no “origin” snapshots. All snapshots are are somehow independent.

So the internals are slightly different, but you are right that the new file is still available from snapshots 5 to 10.

Right, so even if a file is added during the Nth snapshot, all other backups still reference the file’s contents independently, instead of going “through” snapshot N. The latter is how Time Machine works; it stores hard links to files already present in the previous backup, then when you remove the backup that contains the original files they will be moved into the next backup (if they were still present of course). Also, every backup refers to only the direct previous one, so file X in backup 5 will hard link to X in backup 4, which in turn links to X in backup 3 when it was first added. Of course, it doesn’t do delta-based backups so it will copy the entire file again if something changed.

You can see Time Machine acts quite similarly to restic in some aspects, but it’s slow as hell because of some unusual design decisions. Probably to make it easier for “regular” people to use/browse backups, but that causes it to take ages if you have a reasonable amount of data. It looks like restic doesn’t have the same pitfalls so that’s good, thanks for the info. =]

My experience with TimeMachine is that it’s speed is quite ok with local storage / a backup target in the LAN. Once the latency to the backup target reaches a few dozen milliseconds, you can watch TimeMachine making essentially no progress at all.

Just to clarify, that’s not how hard links work. They actually work much like blobs in restic. Creating a hard link is just creating another filename in the filesystem through which the same inode can be accessed. This is why, for example, changing permissions on one of the filenames changes permissions on the other – permissions are stored on the inode.

After creating a hard link, it’s impossible to distinguish just by looking at each file which is the “original.” Both filenames are peers with respect to the inode. Neither “owns” the inode any more than the other.

So deleting the filenames in a prior backup doesn’t “move the contents into the next backup.” Simply, it removes one name through which the same inode can be accessed. The inode remains where it is (until the last filename referring to it is deleted, and only then is the inode freed).

1 Like

And again, I learned something. Thanks :heart_eyes: