Hey, good question! When writing the design document, I though it was a good idea to add this, but in practice it turned out that having superseded index files does not happen often. In regular operation, you should not have an index file that is superseded by another file.
What am I doing? Mostly experimenting and getting to know the repository format better. I’ve spent the last few days writing some tools that directly access the repository (thank you for the excellent description of the repository format!), and that’s been a lot of fun.
I became interested in this because I noticed unexpected behavior when running restic stats --mode files-by-content against the entire repository (without specifying a snapshot):
It’s really slow on my machine (over an hour to run against a 30GB repo on local storage)
It uses very little memory, and surprisingly little disk IO.
But it maxes out a single CPU core on my machine for the entire duration (and leaves all other cores unused).
I started wondering “why is the client working so hard to calculate these stats? Is it decrypting only what is needed for the statistic? Are there some shortcuts available to generate the stats faster? Could it be improved with multithreading? Could I increase speed if I allow it to use more memory?”
So I decided to write a little tool to test those questions.
It’s a work-in-progress. Currently, I can open the repo, get the keys, read and parse the index files (and create an in-memory map of the repo, combining all indices, honoring all indices and mapping blobs to packs), read and parse the snapshots, locate and read blobs.
Next up: parse trees. I think that should be sufficient to begin experimenting with calculating statistics