Remedy for too many snapshots

gurkan · June 19, 2023, 8:06am

Hi

This is a bit of clarification question to understand why restic gets slow/memory-heavy with too many snapshots.

A bit of intro: I currently have ~4000 hosts which are in a nice harmony of getting backups with automated timers and forget/prune cycles of pre-defined retention periods. Whenever an exclusive-lock operation time arrives for a repository, it is externally “locked”, and new backups are redirected to a “failover repo” which has the same key/config and then the data piled up meanwhile is merged into main repository etc.

Sometimes it’s possible that I miss some maintenance, say when a restic process is stuck and I can’t lock the repository for long time, so it skips the forget/prune operation for few days. Then the amount of snapshots pile up on the repository as expected (as an example, I can see 27000+ snapshots on one repository currently), which ends up showing up on OOM-killer working on the hosts, stabbing restic process since it hogs up too much memory.

As far as I can see, this is because restic tries to read all the snapshots in the repository to check which is the best suited for the current backup operation as a parent.

Now my 3 questions are:

Is my assumption correct? Does restic read all the snapshots, or something else going on which makes restic to use too much memory? (Because I only see OOM-killer action after these failed forget/prune attempts)
Can I lower the impact of this operation? E.g. by registering the last successful snapshot ID on the client and forcing restic to use that snapshot as a parent (because, logically there isn’t a better candidate)
Can something be done on the restic side to prevent this? E.g. an index of snapshots or similar to help in slow/high-latency storages.

Thanks for even reading this far.

Also mandatory “restic is damn fire” mention here, you people rock!

MichaelEischer · June 19, 2023, 4:42pm

The backup command has to read every snapshot to determine a suitable parent snapshot, unless you specify the parent snapshot explicitly. But either way that shouldn’t require a lot of a memory as the snapshots are processed one by one. Although looking at 27k snapshots will take quite a bit of time.

The major contributor to the memory usage is the index size, as the index always has to be loaded in full.
And the index size tends to grow with the number of snapshots.

The only reliable solution is to use multiple smaller repositories instead of a single large one.

That would only speed up the process of determining the parent snapshot. But it wouldn’t reduce the memory usage.

gurkan · June 21, 2023, 7:46pm

Interesting, thanks.

I was thinking index files are only evaluated after a parent snapshot is found which refers to them, didn’t think every backup run to read (all?) indexes.

Are indexes read to catch already-uploaded files of interrupted backups (or any concurrent activity) only? If so, is it possible to skip this index-dance altogether e.g. by supplying both --no-scan and --parent options?

alexweiss · June 22, 2023, 6:22am

There is no special treatment for interrupted backups in restic so far. That a “continuation” is possible only relies on the fact that packs are already saved and also the index is regularly updated. So, the already-saved and already-indexed blobs can be (re-)used and need not to be stored anymore. Besides this, a continued interrupted backup is nothing than a new backup.

The index is used to perform the deduplication. If a blob (chunk) is already saved in the repository, it shouldn’t be saved another time. Therefore, after chunking any new file, the information is needed if this chunk already exists.

The parent snapshot only allows to perform the following optimization: If a file hasn’t changed between the parent snapshot backup run and the current backup run (using a change detection only by file system metadata like mtime, etc.), then the file needs not to be read/chunked/saved anymore, but in fact all chunk information can be copied from the parent snapshot (and all contents are already saved as included in the parent).

A remark about

This is true if you mean “full” = all index entries need to be loaded.
This is not fully correct, if you are also talking about index information. In the index quite some information is stored: Type of the blob, ID of the blob, ID of the pack, location of the blob within the pack, compression information.
For backups you need full information for the tree blobs contained in the parent snapshot. For data blobs, however, the ID of the blob is sufficient as you never want to access the blob data, but only check if it is already present or not.
As data blobs are usually the majority of blobs, not storing the superfluous information reduces memory usage quite a bit. (BTW: That’s how I implemented the index for the backup command in the Tool-Which-Must-Not-Be-Named )