Does pruning often shortens the pruning time?

From Removing backup snapshots — restic 0.18.1 documentation

Pruning snapshots can be a time-consuming process, depending on the number of snapshots and data to process. During a prune operation, the repository is locked and backups cannot be completed. Please plan your pruning so that there’s time to complete it and it doesn’t interfere with regular backup runs.

I run restic with forget --prune on a regular basis and was wondering whether to run it often makes the individual runs shorter. In other words

  • is a forget --prune dependent on how much data there is to reorganize? → the less “to be deleted data” there is, the faster it is because this data is known without having to go through the whole backup
  • or is it always more or less the same because all of the backup needs to be analyzed anyway?

Let’s assume a small backup delta, i.e. the data does not change too much (say, less than 1% of the whole backup is new data)

As I understand it, the prune process is basically mark-and-sweep garbage collection:

  • Mark: The tree of each snapshot still in the repository is crawled recursively, and every blob encountered is added to the set of blobs to keep.
  • Sweep: Blobs not in this set are unused, and therefore eligible to be removed (via repacking and/or entire pack deletion for packs containing only unused blobs).

The technical implementation is a bit more complex than this, but that’s the gist. What you’ll notice is that the number of blobs you keep is more relevant than the number of blobs you delete in terms of the CPU and RAM consumption of the “mark” step, and this is probably going to be the most time-consuming step for most repositories. (Repacking can take some time, especially on slower storage backends, but depends on a lot of different factors, some of which are somewhat unpredictable, like how the unused blobs are distributed in packs.)

I’d say the tl;dr answer is: kind of, but not as much as you would think.

1 Like

Some comments:

  • For the detection of used blobs, some abbreviations can be taken, e.g. identical trees in more than 1 snapshots are still visited only once.
  • The tree traversal is usually quite fast as the tree blobs are locally cached. But yes, the more trees you have, the longer that step takes.
  • For the removal of unused data, the repacking is the cruical step (which typically also dominates the whole prune). Here, allowing to keep some unused data (--max-unused) makes the prune not only faster, but allows follow-up prunes to make better decisions about what to remove then.

So, I’d say running prune more often does short the prune time - but the total time will be still larger than running fewer but bigger prunes.

1 Like