Restic prune runtime: cannot allocate memory

First off, v12 is fantastic and the --max-repack-size has proved exactly what was needed for a pair of instances.

The one I’m currently trying to remediate was the instance where I had attempted to resolve the problem of the restic cache filling up root. I’ve made it worse as it’s sister instance is purging functional as expected with --max-repack-size limited to 5g. Trying to do the same on this one however results in a “cannot allocate memory” error after getting to 6/21 snapshots.

I’m not sure where to proceed from here, I know this thread from 2018 had similar issues, but I only see out of memory when trying to run a check which also faults around snapshot 6 or 7.
I’d like to avoid restarting from nothing if possible.

OS: Oracle Linux 6.10
Free space on root: 20g (cache removed)
free memory: 8g (average)

command run and full stack trace posted separately if this proves useless then it won’t crowd the topic.

Seems you are running out of memory. prune still needs to keep the complete index in memory plus the list of used blobs. Your prune run aborted while collecting this list of used blobs.

Did you try to set the environment variable GOGC to let the go garbage collector run more agressively, e.g. GOGC=20 restic ....?

However, if playing around with the garbage collector options does not help, your only possibilities are to prune from a machine with more RAM or forget more snapshots before pruning…

Ah, there could be another issue you should rule out before you start forgetting snapshots: In edge cases, the index could be too large (containing non-existing entries) and therefore needs too much memory. To rule this out, you could try to run a rebuild-index before the pruning.

I have not tried altering the value for GOGC, to be honest I could not find much detail about it outside of the other ticket linked and had already attempted so much troubleshooting on this particular instance that I had concern about making the situation worse.

I will try to rebuild-index and up the value for GOGC to 20 and see what results.

Rebuild ran successfully, and setting the go garbage collection to 20 did the trick for prune!

I’m pretty green in this stuff still, if there’s a good doc on GOGC i’ll just take a link and explore myself as i know this could be a lot to ask…
What is the default value for GOGC and what’d be the maximum?
Is that stored in a file somewhere or only extracted?
20 is mentioned as the aggressive collection value and a general starting point for troubleshooting. I’m guessing I can let my scripting resume without it? Or is there another value I need to reference and update before letting prune run as it was about a month back.

# restic rebuild-index
repository c4eba27b opened successfully, password is correct
loading indexes...
getting pack files to read...
rebuilding index
[0:11] 100.00%  48859 / 48859 packs processed
deleting obsolete index files
[0:00] 100.00%  30 / 30 files deleted
done

Hurray!

# GOGC=20 /usr/local/bin/restic -r $REPO prune --max-repack-size 5g
repository c4eba27b opened successfully, password is correct
loading indexes...
loading all snapshots...
finding data that is still in use for 21 snapshots
[3:02] 19.05%  4 / 21 snapshots
[3:40] 23.81%  5 / 21 snapshots
[3:43] 28.57%  6 / 21 snapshots
[4:22] 33.33%  7 / 21 snapshots
[5:40] 38.10%  8 / 21 snapshots    #passed the problem hurtle!!!
[6:34] 42.86%  9 / 21 snapshots
[7:09] 47.62%  10 / 21 snapshots
[8:07] 52.38%  11 / 21 snapshots
[8:23] 57.14%  12 / 21 snapshots
[8:47] 61.90%  13 / 21 snapshots
[9:44] 66.67%  14 / 21 snapshots
[9:56] 71.43%  15 / 21 snapshots
[10:00] 76.19%  16 / 21 snapshots
[10:09] 80.95%  17 / 21 snapshots
[10:47] 85.71%  18 / 21 snapshots
[10:56] 90.48%  19 / 21 snapshots
[11:07] 95.24%  20 / 21 snapshots
[11:29] 100.00%  21 / 21 snapshots
[11:29] 100.00%  21 / 21 snapshots
searching used packs...
collecting packs for deletion and repacking
[0:19] 100.00%  48859 / 48859 packs processed

to repack:        41511 blobs / 1.676 GiB
this removes      11459 blobs / 1.562 GiB
to delete:       135923 blobs / 110.460 GiB
total prune:     147382 blobs / 112.023 GiB
remaining:      2472846 blobs / 190.128 GiB
unused size after prune: 9.504 GiB (5.00% of remaining size)

repacking packs
[1:09] 100.00%  97 / 97 packs repacked
rebuilding index
[0:13] 100.00%  33992 / 33992 packs processed
deleting obsolete index files
[0:00] 100.00%  53 / 53 files deleted
removing 14896 old packs
[0:39] 100.00%  14896 / 14896 files deleted
done

From the official go docu:

The GOGC variable sets the initial garbage collection target percentage. A collection is triggered when the ratio of freshly allocated data to live data remaining after the previous collection reaches this percentage. The default is GOGC=100. Setting GOGC=off disables the garbage collector entirely.

I also opened an issue in github - maybe in future, restic can handle this GC stuff without much need to tweak parameters by hand.

@Lenski Even though your problem was solved using a lower GOGC value, be warned: This still means that restic’s memory requirements run close to your available memory. You could run into trouble if you have more data saved in your repository with the same memory available…

What backend are you using? 11 minutes to find data in 21 snapshots and 300GiB is a lot. For me it takes 20 seconds to scan 1TiB with 150 snapshots on local HDD.

11 minutes is quite a lot, but this should not depend too much on the backend - at least if you are using a cache which contains all needed data. The main effort here is to traverse all trees for all snapshots. So this mainly depends on the directory structure of the data backup’ed as well as the access speed of the cached data and CPU power (mainly for the encryption)…

I’ll need to do a bit more testing if my script is able to execute correctly on the sister instance, i suspect there is still something off but that’d be another topic.

Alex I’m not sure I am following your previous statement, memory usage by Restic is still a grey area for me. How could I check this or try to correct it?

You could run into trouble if you have more data saved in your repository with the same memory available…

The backend is an S3 bucket in a separate network from the instance, latency plus the quantity of files since it’s database binaries and tables that are being backed up are contributing.
As for the cache, I have been considering that a separate issue. Current I cannot retain the full cache as it’s causing the root directory on both the primary and sister instance to be full. If I understand the usecase correctly --max-cache-size should help retain the cache to the maximum specified at the time of running prune?