Memory usage (OOM etc) on a big repository

hubodz · November 29, 2024, 8:27am

I run a restic daily backup for a big repository (around 1,2TB) for around a year. First it was running on a VM, recently moved do Kubernetes.

In both cases, RAM usage seems to be very huge while loading the index.
It can take up whole a 30GB RAM of a VM, and gets OOM (return code 137) inside container limited to 12GB of RAM.

In one of threads I read restic memory consumption should be somehow close to the index size which in this case is around 6-7GB (still much less than what restic actually consumes with respect to RAM). I tried to use GOGC=10, but it didn’t really help.

Is there anything I can do to limit the memory usage?

MichaelEischer · November 29, 2024, 9:22pm

You have neither specified which restic version you’re using nor which commands are run when the problem occurs. Without that information it’s not possible to give any proper advice.

To get an idea of the expect memory requirements for the repository please run restic stats --mode debug.

A 6-7GB index is excessively large for a 1.2TB repository. That roughly means that there are 150 million blobs in the repository (assuming repo format version 2). Has the repository every been pruned?

The rule of thumb is that 1GB RAM is necessary for every 7 millions blobs. So no matter which options you try to specify, at that repository size the memory usage cannot go much below 22GB. (That is using the latest restic version, older version can use much more memory).

hubodz · December 3, 2024, 3:29pm

Hello!
Thank you for your answer.

Regarding the restic version:

Repository was initiated with 15.X,
After release of 16.X, daily backups were upgraded to use it
Recently I upgraded to 17.3 recently
Described issues were visible in the recent version as well.

The command concerned is restic backup --no-scan

I couldn’t do the prune, as it was saturating my storage. Instead, I used restic forget to decrease number of historic snapshots from 390 to 90. This decreased size of the index to around 1.3GB. The memory usage for backup decreased to ~3.5GB.

Regarding restic stats please see result *after the cleanup below. The biggest difference is visible in the last section (tree): before cleanup it was around 150 milion (110GB).

I think at this point I would have just 2 questions:

Does the repo after cleanup look healthy to you? Or you would still investigate?
Any chance restic could support distribution in the future? (so that it could run on several smaller nodes, instead of a single one)?

Collecting size statistics

File Type: key
Count: 1
Total Size: 463 B
Size            Count
---------------------
100 - 999 Byte  1
---------------------

File Type: lock
Count: 1
Total Size: 171 B
Size            Count
---------------------
100 - 999 Byte  1
---------------------

File Type: index
Count: 279
Total Size: 1.339 GiB
Size                      Count
-------------------------------
      10000 - 99999 Byte  1
    100000 - 999999 Byte  2
  1000000 - 9999999 Byte  252
10000000 - 99999999 Byte  24
-------------------------------

File Type: data
Count: 45578
Total Size: 802.925 GiB
Size                      Count
-------------------------------
      10000 - 99999 Byte  1
    100000 - 999999 Byte  42
  1000000 - 9999999 Byte  104
10000000 - 99999999 Byte  45431
-------------------------------

Blob Type: data
Count: 3849448
Total Size: 781.835 GiB
Size                    Count
-------------------------------
          10 - 99 Byte  59
        100 - 999 Byte  19405
      1000 - 9999 Byte  466929
    10000 - 99999 Byte  1808271
  100000 - 999999 Byte  1436006
1000000 - 9999999 Byte  118778
-------------------------------


Blob Type: tree
Count: 27374818
Total Size: 19.897 GiB
Size                  Count
------------------------------
        10 - 99 Byte  1
      100 - 999 Byte  22308880
    1000 - 9999 Byte  4968299
  10000 - 99999 Byte  97549
100000 - 999999 Byte  89
------------------------------

MichaelEischer · December 8, 2024, 5:34pm

That looks much more reasonable. Although, the number of tree blobs is still rather high compared to the number of data blobs in my opinion. Are there by any chance filesystem snapshots involved in the backup creation process?