Restic prune “runtime: out of memory”

sander · November 8, 2023, 10:16pm

Hello there,

am new to restic and having trouble with my restic backups. It was running all fine these days suddenly i am getting OOM killed for the pods. Initially expanding the memory size worked then again when we tried to trigger backup it failed with error. And then we removed both request and limit but still it seems to be ending with error while trying to prune.

Restic version : 0.16.0
rclone:1.64.2

The place where i get error in my script is for below step in backup

restic forget --keep-within “${RESTIC_KEEP_WITHIN}”

restic prune --max-unused 0

loading indexes...
loading all snapshots...
finding data that is still in use for 2664 snapshots
runtime: out of memory: cannot allocate 79691776-byte block (4000055296 in use)
fatal error: out of memory

goroutine 153 [running]:
runtime.throw({0x8d7c43b, 0xd})
	/usr/local/go/src/runtime/panic.go:1077 +0x4d fp=0x9f75a6c sp=0x9f75a58 pc=0x808467d
runtime.(*mcache).allocLarge(0xf7f94c40, 0x4b70000, 0x1)
	/usr/local/go/src/runtime/mcache.go:236 +0x1c7 fp=0x9f75a98 sp=0x9f75a6c pc=0x805fef7
runtime.mallocgc(0x4b70000, 0x8c5ea00, 0x1)
	/usr/local/go/src/runtime/malloc.go:1123 +0x5d7 fp=0x9f75aec sp=0x9f75a98 pc=0x8057997
runtime.newarray(0x8c5ea00, 0x44000)
	/usr/local/go/src/runtime/malloc.go:1346 +0x44 fp=0x9f75b00 sp=0x9f75aec pc=0x8057ee4
runtime.makeBucketArray(0x8d00040, 0x12, 0x0)
	/usr/local/go/src/runtime/map.go:364 +0x15b fp=0x9f75b18 sp=0x9f75b00 pc=0x8058c1b
runtime.hashGrow(0x8d00040, 0xa092400)
	/usr/local/go/src/runtime/map.go:1068 +0x78 fp=0x9f75b3c sp=0x9f75b18 pc=0x805a408
runtime.mapassign(0x8d00040, 0xa092400, 0x9f75ba4)
	/usr/local/go/src/runtime/map.go:659 +0xd6 fp=0x9f75b8c sp=0x9f75b3c pc=0x8059346
github.com/restic/restic/internal/restic.CountedBlobSet.Insert(0xa092400, {{0x93, 0x4d, 0xe5, 0x41, 0x76, 0x11, 0xfb, 0x27, 0xee, ...}, ...})
	/restic/internal/restic/counted_blob_set.go:28 +0x33 fp=0x9f75ba0 sp=0x9f75b8c pc=0x840e643
github.com/restic/restic/internal/restic.FindUsedBlobs.func1({0x93, 0x4d, 0xe5, 0x41, 0x76, 0x11, 0xfb, 0x27, 0xee, 0xc8, ...})
	/restic/internal/restic/find.go:35 +0xb6 fp=0x9f75c08 sp=0x9f75ba0 pc=0x840fad6
github.com/restic/restic/internal/restic.filterTrees({0x8f340c0, 0x1b5ca8d0}, {0xef0ad340, 0x9ca2d80}, {0x1c284000, 0xa68, 0xe00}, 0x1b5b5f40, 0x1c584000, 0x1b5b5f80, ...)
	/restic/internal/restic/tree_stream.go:72 +0x2a8 fp=0x9f75f44 sp=0x9f75c08 pc=0x8421028
github.com/restic/restic/internal/restic.StreamTrees.func3()
	/restic/internal/restic/tree_stream.go:192 +0x15a fp=0x9f75fc0 sp=0x9f75f44 pc=0x842208a
golang.org/x/sync/errgroup.(*Group).Go.func1()
	/home/build/go/pkg/mod/golang.org/x/sync@v0.4.0/errgroup/errgroup.go:75 +0x60 fp=0x9f75ff0 sp=0x9f75fc0 pc=0x83dbe50
runtime.goexit()
	/usr/local/go/src/runtime/asm_386.s:1363 +0x1 fp=0x9f75ff4 sp=0x9f75ff0 pc=0x80b8061
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 1
	/home/build/go/pkg/mod/golang.org/x/sync@v0.4.0/errgroup/errgroup.go:72 +0x99

Can someone help me here how to resolve this issue. And also in addition we have already set GOGC=20 as well but no luck.

Usually i run this as shell script and it will work inside cronjob.

now to test the script I had tried to run the manual steps with below command

restic forget --keep-within 7d --prune -v
restic prune --max-repack-size 0


loading indexes...

[1:58] 100.00% 183 / 183 index files loaded

loading all snapshots...

finding data that is still in use for 278 snapshots

runtime: out of memory: cannot allocate 633339904-byte block (3320643584 in use)

fatal error: out of memory

sander · November 9, 2023, 9:10am

Or can i skip the restic prune --max-repack-size 0 and only use restic forget --keep-within “${RESTIC_KEEP_WITHIN}” --prune ???

kapitainsky · November 9, 2023, 10:00am

Your repo contains a lot of snapshots… which needs some RAM to process. If you do not have enough RAM try to limit number of snapshots you keep.

But to prune this repo you have to run restic with enough RAM.

sander · November 9, 2023, 10:18am

Problem here is am running this as cronjob and it runs as pod and we have removed both request and limit which means it can occupy even the node completely but still it fails .

and whenever i try manually also with
restic forget --keep-within "${RESTIC_KEEP_WITHIN}" --prune -v
it just gives the output like

ID        Time                 Host                                Tags        Reasons        Paths
---------------------------------------------------------------------------------------------------
f2d99461  2023-10-06 00:14:42  backup-28275840-dzglw              last snapshot  /gcs
---------------------------------------------------------------------------------------------------
1 snapshots

but it does not delete any snapshots so again when i try to give

restic prune --max-repack-size 0

loading indexes...
[2:15] 100.00%  434 / 434 index files loaded
loading all snapshots...
finding data that is still in use for 278 snapshots
command terminated with exit code 137 ```

kapitainsky · November 9, 2023, 10:28am

Connect to this repo from another machine with enough RAM to prune it and then limit numbers of snapshots you keep. Do not forget to run frequent prunning - if you accumulate tones of old data to prune it needs much more RAM.

sander · November 9, 2023, 11:01am

Sure I can do that from my local then which has capacity of 32Gb ram.

Is there any other we can run this prune which will be effective . Because I can see in integration is around 278 only but in prod i have somewhere around 2884 snapshots. So it will be tedious if this has to be done in prod . And I run cronjob every three hours which will have to do backup and then prune the old data . So i think even though the job was completed , prune did not work as expected?

kapitainsky · November 9, 2023, 11:05am

What I would do is to run prune once a day e.g. at night. Definitely not with every backup. Or do not prune from thin clients - use some more capable machine to do hard job. This way your pods only run backups.

For sure there is more than one way to approach it - everything is about right balance I think.

For very light system I would not use restic at all - rsync data to backup server and run restic from there for example.

restic will always need RAM - some future PRs will optimize things here and there but general problem will always exist - a lot of snapshots, a lot of files - a lot of RAM needed. Every system has limits.

sander · November 9, 2023, 12:57pm

hmm but i feel this is strange this was running in my infra for more than a year now without any issues and now i have the space issue.

# -----------------------------------------------

# --  Removing old backups

# -----------------------------------------------

echo  "starting pruning old backups"

restic forget --keep-within "${RESTIC_KEEP_WITHIN}" --prune -v

# speedup prune process to keep 15% repo size to avoid additinal API calls for repacking everytime

restic prune --max-repack-size 0

echo "  finished pruning old backups"

So in this code if i have to remove the restic prune --max-repack-size 0 from cronjob and use it separately in another VM it would impact in cost factor right

kapitainsky · November 9, 2023, 1:24pm

The repo was growing and restic has been using more and more memory… so not a big susrprise.

Do not know what forget details but you accumulated 2664 snapshots…

This would be one approach yes.

Let’s see maybe others will have more clever ideas.

sander · November 9, 2023, 1:26pm

ok . One last thing

restic forget --keep-last $RESTIC_KEEP_LAST --prune

wont this forget and prune as per documentation . If this itself does the prune then I think i need not to use restic prune --max-repack-size 0 which will be resolve my issue probably ?

kapitainsky · November 9, 2023, 1:28pm

Yes you are right.

I doubt. Why you think it will use less memory when doing the same job? But the best way is to try.

sander · November 9, 2023, 3:32pm

Yes now the backup is completed after removing the

restic prune --max-unused 15%

it worked . But am quite unsure whether the prune is really done through
restic forget --keep-last $RESTIC_KEEP_LAST --prune

kapitainsky · November 9, 2023, 4:26pm

You can see it in its output…

I use it this way all the time. And I can see in my logs that it runs as intended.

sander · November 9, 2023, 4:36pm

No usually when we run restic prune we might get something like

loading indexes...
[0:36] 100.00%  435 / 435 index files loaded
loading all snapshots...
finding data that is still in use for 279 snapshots
[14:26] 100.00%  279 / 279 snapshots
searching used packs...
collecting packs for deletion and repacking
[0:01] 100.00%  2998 / 2998 packs processed

to repack:          6308 blobs / 2.192 MiB
this removes:          0 blobs / 0 B
to delete:             0 blobs / 0 B
total prune:           0 blobs / 0 B
remaining:      27390259 blobs / 42.143 GiB
unused size after prune: 0 B (0.00% of remaining size)

repacking packs
[0:05] 100.00%  11 / 11 packs repacked
rebuilding index
[1:49] 100.00%  2988 / 2988 packs processed
deleting obsolete index files
[0:04] 100.00%  435 / 435 files deleted
removing 11 old packs
[0:00] 100.00%  11 / 11 files deleted
done

but for me now after removing that line restic prune --max-unused 15% i dont get those above lines just it lists the snapshots .

kapitainsky · November 9, 2023, 4:42pm

Because this prune will only run when forget removes some snapshots

As per docs:

The latter can be automated with the --prune option of forget, which runs prune automatically if any snapshots were actually removed.

sander · November 9, 2023, 7:45pm

Ok so do you suggest that still to run restic prune --max-unused 15% in the vm /separate node ? or that is not required now?

Sorry just wanted to be confirm on this prune stuff.

kapitainsky · November 9, 2023, 9:09pm

I do not know your system and you provide scant details so I think there is a bit of misunderstanding. I do not suggest anything as I have no data to do it. But general pattern is always the same.

What I am saying is:

Running prune is memory heavy operation - more snapshots and files you have more RAM will be needed. There is no magic here. How many MB/GB? You have to monitor your system and get numbers yourself.
One way to lower memory pressure is to limit your repo size - if you have OOM with 3000 snapshots then maybe time to consider to keep less? or add as much RAM as is needed.
If you have to keep a lot of data in your repo then other angle is to let your low resources clients only run backups maybe and shift prune to more powerful device?

How you balance these - you have to decide. Run some tests. Track memory usage.

alexweiss · November 10, 2023, 6:59am

Just a side note: The number of snapshots is mostly irrelevant for memory usage!. The difference between a few and a million snapshots would be just a couple of 100 MiBs with respect to to memory usage.

The most important important factor is the number of blobs - and with large repositories, you usually have many blobs. So, typically large repositories are more problematic. (To be perfectly correct: Also repositories with lots of very small files being backuped will be problematic, but this is an edge case.)

Of course, if you have more snapshots you tend to have more data and therefore more blobs. But if you run backup each minute without a change in the backup source, you’ll get lots of snapshots which all reference the same data - this is absolutely no problem for prune.

TL;DR: small repositories with lots of snapshots are usually unproblematic, large repositories with few snapshots already require much memory.

MichaelEischer · November 10, 2023, 10:58pm

restic forget [...] --prune only runs prune if it deleted some snapshots. To always run restic prune, just call that command explicitly. The --max-unused and --max-repack-size options don’t have any effect on the OOM crash shown in the initial message.

As a rough estimate, you can expect about 1GB required memory per 6 million unique files and per 6TB data stored in the repository. Each unique version of a file counts as a file; if a file stays the same all the time, then it only counts once. There are a few exceptions that require additional memory: folders that directly contain millions of files (subfolders are not relevant here!) or multi-TB files.

As you only seem to keep snapshots for seven days, the best way forwards is to split the repository into multiple smaller ones.