Group files by age - Like a generational garbage collector

chrestomanci · September 25, 2018, 3:42pm

I have had an idea about how to improve restic, specifically how to reduce the churn of unused files when repacking after a prune. (See also pull request #1994 - Prune repack threshold).

My thinking is that repacking packs in the backup to remove unused blobs is similar to how a garbage collector works in a language like Java, and it suffers from similar problems that a lot of stuff dies young and needs to be removed, creating a lot of churn.

The fix in virtual machine garbage collector is to treat young and old objects separately. Restic could do the same for the files that it backs up and those that it moves into new packs when repacking as part of a prune & forget.

My suggestion is that when doing a backup. Restic should sort the files by creation time, and put the oldest files in separate packs from the youngest. (If that is too expensive, just separate them into 4 or so bins by age). A file that is more than a year old will probably be around for a long time, while one that is only a few hours old could well get deleted soon, so should share a pack with other young files.

My thinking is that if we had this feature, then a prune and forget would do much less I/O because most packs would only contain old files that don’t need repacking. Also the threshold feature would work better because a low threshold (eg 20%), would cause most young packs to be rebuilt while not affecting old packs that had one or two files removed.

764287 · September 26, 2018, 7:13am

I’m not a developer but from my understanding this won’t work because files are split in parts/chunks and then packed. Those chunks can be part of any file in any snapshot.

wscott · September 29, 2018, 10:11am

This is a good idea and the backup is similar to a chunked memory allocator and so similar solutions can apply.

My question is does this matter in the steady state? After the initial backup everything being saved is a new change by definition and after the oldest snapshot being removed by prune is newer than the initial backup then what are you packing? I guess this suggestion is for prune when it is removing N packs and replacing it with M new ones and you think if the M new packs are sorted correctly we can make N smaller the next time.

To @764287’s point, I think we could use the timestamps of the N packs being removed as a proxy for the age of the data. I am not sure restic gets that timestamp data from all of the backends but is there a header in the packfile with a timestamp? So when the M new packs are being created perhaps processing the N packs in time order would be more efficient.

To address stuff like this we need a reproducible realistic set of backup data. The duplicacy benchmarks used a series of snapshots from the linux kernel. You could try different repacking methods and then backup the kernel every week for the last 10 years and see if prune traffic is reduced. It would also make a nice way to select a good default repack threshold.

fd0 · September 29, 2018, 12:29pm

Thanks for reporting suggestions on how to improve restic! While this is a nice idea in theory, I think it’s not as good in practice. Several things come to my mind:

Sorting the files by modification date before processing them during backup is hard to do, since restic walks the paths given on the command line in depth-first order. Only ordering files in a single directory seems pointless.
Restic splits the files into one or more chunks, and afterwards the chunks are completely independent from the files and restic doesn’t care any more where a particular blob came from. Tracking this information will be expensive (in terms of memory)
A blob may be referenced by many different files, which may be completely different in terms of age
The first run of prune will probably mix the blobs together anyway, except if we take special care (which will be memory-intensive again
For the very first backup, this strategy seems a nice idea, but what about subsequent backups? Most blobs will already be in the repo, and all files newly added will have a recent modification date, so then we can just carry on as usual?

This does sound very negative (I’m sorry for that), but I don’t think it’s a good idea to improve the packing algorithm at this point, before we do some major cleanups and restructurings

FYI, I have a restic repo containing the extracted Linux kernel sources for each revision starting at 2.2 or so in a snapshot, it’s about 50GiB in size.