Thanks for the response, and happy new year!
What am I doing? Mostly experimenting and getting to know the repository format better. I’ve spent the last few days writing some tools that directly access the repository (thank you for the excellent description of the repository format!), and that’s been a lot of fun.
I became interested in this because I noticed unexpected behavior when running
restic stats --mode files-by-content against the entire repository (without specifying a snapshot):
- It’s really slow on my machine (over an hour to run against a 30GB repo on local storage)
- It uses very little memory, and surprisingly little disk IO.
- But it maxes out a single CPU core on my machine for the entire duration (and leaves all other cores unused).
I started wondering “why is the client working so hard to calculate these stats? Is it decrypting only what is needed for the statistic? Are there some shortcuts available to generate the stats faster? Could it be improved with multithreading? Could I increase speed if I allow it to use more memory?”
So I decided to write a little tool to test those questions.
It’s a work-in-progress. Currently, I can open the repo, get the keys, read and parse the index files (and create an in-memory map of the repo, combining all indices, honoring all indices and mapping blobs to packs), read and parse the snapshots, locate and read blobs.
Next up: parse trees. I think that should be sufficient to begin experimenting with calculating statistics