Huge data backup

g33kphr33k · January 15, 2024, 12:57pm

Hey folks

I have 2 x 528TB NAS devices that I’m filling with Video. We need to keep the data versioned, just in case. Originally I was using rsync but that came with it’s own headaches, such as file renames and moves not being tracked.

We’re local network, 10Gb, TrueNAS Scale (Debian), and ZFS as file system. We’ve disabled atime, ZFS compression and dedupe as we’re leaving that to Restic.

I’m currently at 100TB stored and the final job is going now, 158TB directory structure and it’ll finish in approximately 93 hours. We’re rocking around 1TB an hour for the most part.

Now, I’m this far in before I start to check with the community (and devs) if this will hold up well to daily backups? I’m going to just throw out a list of questions and any tips and tricks I should definitely be doing on this scale of a backup would be much appreciated:

Can anyone see any issues with a local network (different sites, 10Gb backbone) Restic backup on this amount of data?
I’ve broken the backup in to 8 jobs, one for each top level directory. Should I just run this from the top rather than having it as separate jobs? Example: /mnt/data/videos/A-D, /mnt/data/videos/E-H or go with just /mnt/data/videos/ and be damned?
How often should I prune/forget and run checks on this data? Deletions should be few and far between, but renames and additions will happen often. Right now, I was planning to do monthly on the first day as part of the script.

I think that is about it for advice request, but I’m really open to people throwing their thoughts in to how I should be maintaining this job.

Just to note, should the absolute worst happen, and I lost the second NAS, plus the main one burned to the ground, we could spend months pulling these back from LTO tape, although I really really REALLY wouldn’t want to.

Cheers.

kapitainsky · January 15, 2024, 1:03pm

Any reason not to use ZFS snapshots replication? It will be light years faster than restic, you can maintain versioning and in case of your primary NAS failure you could simply switch to your backup NAS as a failover. I do not see what benefits restic brings in your situation.

g33kphr33k · January 15, 2024, 1:38pm

Hi @kapitainsky - The repo will also have files sent to it from one other server adhoc, so I can benefit from the dedupe. I should probably have included that in the original post.

I’ve found ZFS to be fickle; you need to understand a lot about it to make 100% you don’t screw it up. People that work with it daily seem to love it. I don’t work with it daily, and I’ve had a burn from a bad configuration from a vendor (all disks in single vdev, the pool strained to unusable when dedupe and compression was enabled). I’ll be sticking with Restic unless someone says it is bad for large data backups.

kapitainsky · January 15, 2024, 1:50pm

fair enough:) ZFS configuration indeed is not something you can do after one evening reading some wiki.

So back to restic.

As it is all on local disks/network I would definitely increase pack size to the maximum allowed 128MiB.

I do not think it makes much difference to have separate jobs if all goes to one repo anyway. Extra complexity without clear benefit IMO.

So I would run check after every prune. How often to run it? There is no good answer really. I think as you mention that changes are rather slow then monthly can be enough.

Are you running it against live filesystem? Or you run restic backup against snapshots?

g33kphr33k · January 15, 2024, 2:54pm

It’s running on the file system.

Can the packsize by set after the initial backup and the repo slowly adjusts in time?

kapitainsky · January 15, 2024, 2:55pm

Yes you can set it later.

ProactiveServices · January 15, 2024, 3:21pm

Generally asking if a certain speed or scale is workable is so dependent on your CPU speed, storage read/write and network speeds, at both the read and the write ends, multiplied by how many bytes and files you’re changing per backup, makes it difficult for anyone to tell you if there will be performance problems. If you’re adding to large files then you’ll need high storage read and CPU crypto speeds. If you’re adding a lot of new data, you’ll need good performance in all cases. A small amount of changes to new, or smaller files will be a quick backup.

Your best bet is to get as close as to your intended set-up, and then try adding data to the source files in as close a manner that your peak workload would be adding, multiplying it by five or even ten, then benchmarking the backups. Ask yourself how the workload may change in six months, one year, two, five years.

Having eight separate jobs can have advantages. If you are backing up to eight different repositories, then you can check and/or prune them in parts, which will be faster. You will potentially lose some deduplication, but I’m guessing that with video it won’t hurt much. If you suffer damage to a repository - which is unlikely but possible - then having only one eighth of the data to analyse, copy, and repair may be advantageous.

As for checks and pruning - how long is a piece of string? I’d try for a quick check (restic check) once a day and tend to either perform a full check (restic check --read-data) and, if all OK, a forget/prune, once a week/month, or for larger repos I run a subset check (restic check --read-data-subset x/y) once a week and after the last subset has checked OK, then a forget/prune. All backup, check and forget/prune tasks are monitored for failures.

Renames will only add metadata changes to a repo, so even frequent renames aren’t necessarily a reason to increase pruning frequency - unless you’re moving data that means it goes from “repo a” to “repo b”, if using multiple repos.

You might want to consider the tuning parameters as mentioned elsewhere, particularly the read concurrency one if reading from SSD.

MichaelEischer · January 15, 2024, 6:58pm

Make sure to pass the packsize to every restic command, the size is currently not stored in the repository.

Regarding your questions:

I’m not aware of many ~100TB restic backups. It will probably work although you might encounter a few scaling issues. As a rule of thumbs (likely an overestimation), you should plan with 1GB RAM for every 7million unique files (across all snapshots) and 1GB RAM for every 7TB of backed up data.
Does separate jobs mean separate repositories or a single large one? I’d prefer multiple smaller repositories as that brings the repository size closer to the usual territory.
If data is only rarely removed, then its sufficient to run prune every few weeks (or in even larger intervals). I’d try to keep the number of snapshots at around a few hundred rather than thousands of snapshots.

Performance-wise the already mentioned --pack-size 128 is likely pretty important. Besides that it might be worthwhile to tune the number of backend connections (see Tuning Backup Parameters — restic 0.16.3 documentation). Especially the restore performance could benefit from that.

g33kphr33k · January 19, 2024, 10:04am

I finally hit a wall. I was merrily going along, things were looking great. Then, zsh killed restic.

dmesg | grep restic

[2730979.532676] restic invoked oom-killer: gfp_mask=0x1101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE), order=0, oom_score_adj=0
[2730979.544459] CPU: 0 PID: 803868 Comm: restic Tainted: P           OE     5.15.131+truenas #1
[2730980.599237] [ 803650]     0 803650  7838154  7525188 60526592        0             0 restic
[2731072.248432] [ 803650]     0 803650  7838154  7525188 60526592        0             0 restic
[2731093.814805] [ 803650]     0 803650  7838154  7525188 60526592        0             0 restic
[2731112.187822] restic invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
[2731112.196958] CPU: 0 PID: 803826 Comm: restic Tainted: P           OE     5.15.131+truenas #1
[2731113.321536] [ 803650]     0 803650  7838154  7525188 60526592        0             0 restic
[2731117.221546] restic invoked oom-killer: gfp_mask=0x1101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE), order=0, oom_score_adj=0
[2731117.233264] CPU: 47 PID: 803856 Comm: restic Tainted: P           OE     5.15.131+truenas #1
[2731118.289811] [ 803650]     0 803650  7838154  7525188 60526592        0             0 restic
[2731158.531994] restic invoked oom-killer: gfp_mask=0x1101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE), order=0, oom_score_adj=0
[2731158.543393] CPU: 31 PID: 803851 Comm: restic Tainted: P           OE     5.15.131+truenas #1
[2731159.595040] [ 803650]     0 803650  7838154  7525836 60530688        0             0 restic
[2731215.304523] restic invoked oom-killer: gfp_mask=0x1101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE), order=0, oom_score_adj=0
[2731215.315961] CPU: 12 PID: 803876 Comm: restic Tainted: P           OE     5.15.131+truenas #1
[2731216.386553] [ 803650]     0 803650  7838154  7528973 60555264        0             0 restic
[2731220.116242] restic invoked oom-killer: gfp_mask=0x1101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE), order=0, oom_score_adj=0
[2731220.164900] CPU: 27 PID: 803858 Comm: restic Tainted: P           OE     5.15.131+truenas #1
[2731221.229296] [ 803650]     0 803650  7838154  7528973 60555264        0             0 restic
[2731221.311797] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-667.scope,task=restic,pid=803650,uid=0
[2731221.329443] Out of memory: Killed process 803650 (restic) total-vm:31352616kB, anon-rss:30115892kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:59136kB oom_score_adj:0
[2756882.714706] restic invoked oom-killer: gfp_mask=0x1100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
[2756882.725948] CPU: 21 PID: 3718362 Comm: restic Tainted: P           OE     5.15.131+truenas #1
[2756884.016171] [3718049]     0 3718049  7854234  7519722 60469248        0             0 restic
[2756885.453923] [3718049]     0 3718049  7854234  7543461 60657664        0             0 restic
[2756887.205885] [3718049]     0 3718049  7854234  7549928 60706816        0             0 restic
[2756888.614266] [3718049]     0 3718049  7870895  7564575 60825600        0             0 restic
[2756890.089872] [3718049]     0 3718049  7887556  7577605 60932096        0             0 restic
[2756892.269927] [3718049]     0 3718049  7904217  7588256 61014016        0             0 restic
[2756894.272389] [3718049]     0 3718049  7920878  7600559 61116416        0             0 restic
[2756895.547734] [3718049]     0 3718049  7920878  7603749 61140992        0             0 restic
[2756896.912822] [3718049]     0 3718049  7920878  7611071 61198336        0             0 restic
[2756896.939794] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=k3s.service,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-667.scope,task=restic,pid=3718049,uid=0
[2756896.957583] Out of memory: Killed process 3718049 (restic) total-vm:31683512kB, anon-rss:30444284kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:59764kB oom_score_adj:0
[2757173.163504] [3749517]     0 3749517  7886260  7580297 60940288        0             0 restic
[2757174.862268] [3749517]     0 3749517  7902921  7592510 61038592        0             0 restic
[2757176.225137] [3749517]     0 3749517  7902921  7599796 61095936        0             0 restic
[2757176.338460] [3749517]     0 3749517  7902921  7601520 61108224        0             0 restic
[2757178.928114] [3749517]     0 3749517  7919582  7606817 61153280        0             0 restic
[2757180.345755] [3749517]     0 3749517  7919582  7608723 61165568        0             0 restic
[2757180.400107] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=system-getty.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-1028.scope,task=restic,pid=3749517,uid=0
[2757180.419027] Out of memory: Killed process 3749517 (restic) total-vm:31678328kB, anon-rss:30434892kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:59732kB oom_score_adj:0
[2757733.669394] restic invoked oom-killer: gfp_mask=0x1100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
[2757733.680692] CPU: 11 PID: 3820057 Comm: restic Tainted: P           OE     5.15.131+truenas #1
[2757734.967949] [3819781]     0 3819781  7789999  7481678 60186624        0             0 restic
[2757735.184484] restic invoked oom-killer: gfp_mask=0x1100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
[2757735.196752] CPU: 16 PID: 3820048 Comm: restic Tainted: P           OE     5.15.131+truenas #1
[2757736.437693] [3819781]     0 3819781  7856771  7528763 60588032        0             0 restic
[2757736.707249] restic invoked oom-killer: gfp_mask=0x1100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
[2757736.719348] CPU: 34 PID: 3820056 Comm: restic Tainted: P           OE     5.15.131+truenas #1
[2757737.820041] [3819781]     0 3819781  7856771  7531509 60608512        0             0 restic
[2757738.800012] restic invoked oom-killer: gfp_mask=0x1100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
[2757738.902911] CPU: 26 PID: 3819792 Comm: restic Tainted: P           OE     5.15.131+truenas #1
[2757738.904043] [3819781]     0 3819781  7856771  7545753 60715008        0             0 restic
[2757744.199128] [3819781]     0 3819781  7856771  7548498 60715008        0             0 restic
[2757746.613211] restic invoked oom-killer: gfp_mask=0x1100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
[2757746.624658] CPU: 15 PID: 3820054 Comm: restic Tainted: P           OE     5.15.131+truenas #1
[2757747.762992] [3819781]     0 3819781  7925031  7609508 61181952        0             0 restic
[2757749.083438] [3819781]     0 3819781  7925031  7612428 61206528        0             0 restic
[2757749.393704] restic invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[2757749.404863] CPU: 47 PID: 3820033 Comm: restic Tainted: P           OE     5.15.131+truenas #1
[2757750.562008] [3819781]     0 3819781  7925031  7612546 61206528        0             0 restic
[2757750.588775] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-1028.scope,task=restic,pid=3819781,uid=0
[2757750.606519] Out of memory: Killed process 3819781 (restic) total-vm:31700124kB, anon-rss:30450184kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:59772kB oom_score_adj:0
[2757919.872139] restic invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[2757919.882430] CPU: 14 PID: 3857288 Comm: restic Tainted: P           OE     5.15.131+truenas #1
[2757921.154758] [3856964]     0 3856964  7890781  7565766 60887040        0             0 restic
[2757921.459804] [3856964]     0 3856964  7890845  7567872 60903424        0             0 restic
[2757923.641906] [3856964]     0 3856964  7890845  7574413 60960768        0             0 restic
[2757923.760797] restic invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[2757924.253283] CPU: 3 PID: 3857290 Comm: restic Tainted: P           OE     5.15.131+truenas #1
[2757925.483832] [3856964]     0 3856964  7890845  7582426 61009920        0             0 restic
[2757927.556632] [3856964]     0 3856964  7890845  7583551 61009920        0             0 restic
[2757928.304548] restic invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[2757928.315114] CPU: 13 PID: 3857290 Comm: restic Tainted: P           OE     5.15.131+truenas #1
[2757929.508293] [3856964]     0 3856964  7890845  7584233 61009920        0             0 restic
[2757931.127266] [3856964]     0 3856964  7890845  7584257 61009920        0             0 restic
[2757933.767222] [3856964]     0 3856964  7942028  7609158 61222912        0             0 restic
[2757935.030199] [3856964]     0 3856964  7942028  7612361 61247488        0             0 restic
[2757935.039206] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=k3s.service,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-1028.scope,task=restic,pid=3856964,uid=0
[2757935.057050] Out of memory: Killed process 3856964 (restic) total-vm:31768112kB, anon-rss:30449444kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:59812kB oom_score_adj:0
[2757937.375263] oom_reaper: reaped process 3856964 (restic), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

I’m now playing with “export GOGC=x” to see if it’ll help.

Time to order MORE RAM!

kapitainsky · January 19, 2024, 10:24am

How much RAM you have at the moment? It might be interesting for others trying similarly sized backups.

g33kphr33k · January 19, 2024, 10:29am

48GB, but restic and ZFS are arguing about it. The repo hit 199TB before ZFS murdered the process. GOGC=1 has allowed me to run a check at least.

More RAM being ordered today. I’ll keep updating this thread so people can follow working with TrueNAS Scale/Debian and requirements for large data sets.

kapitainsky · January 19, 2024, 10:34am

based on @MichaelEischer post earlier it makes sense.

200TB repo might be hitting 40GB RAM requirement

500TB repo will need about 80GB RAM

You should be ok with 128GB RAM - and I do not think it is really restic limitation. Yes maybe things can be optimised here and there but when working with 500TB dataset nobody should expect to be fine with for example 8GB RAM computer:)

It would be nice when you are done to share how long check or prune takes etc. And maybe how much RAM is used in real life.

g33kphr33k · January 19, 2024, 2:13pm

Once I’ve got this first backup done and it’s working nicely, I’ll share whatever stats you good folks would like

MichaelEischer · January 19, 2024, 7:54pm

Sorry for the misleading quote . But please, please don’t store that much data in a single restic repository. This scale is completely untested and if the index or something else in the repository gets damaged, then it will likely take weeks before you will be able to repair the repository such that you can restore data from it. (At 1GB/s (Gigabyte not Bit!) it will take 6 days to read all data once, and the actual throughput will probably be lower).

alexweiss · January 19, 2024, 9:57pm

If just the index gets damaged, IMO it shouldn’t take that long to repair it - to do so, only the pack headers need to be read.

With 128MiB per pack file I would expect a pack file header of around 5-6 kiB. With 100TiB repository size this will be approx. 1 Mio pack files, so roughly 5-6 GiB of header data to read.

At 1GB/s this would be just a couple of seconds (which is course not realistic as we have to read many small parts, but still a couple of minutes or few hours should be enough). And this is the case where the index is completely rebuilt.

g33kphr33k · January 19, 2024, 10:13pm

Just please don’t ask me to break it on purpose to test

Backbone is 10Gb so 1GB a sec is fine between units, they’re even on the same switch stack.

MichaelEischer · January 20, 2024, 11:19am

True if you can guarantee that it is only the index. If you have to run restic check --read-data then you’re out of luck.

g33kphr33k · January 23, 2024, 9:11am

We’re done. It’s all backed up. Now to script it nicely and keep a good check on it. Did anyone want me to give over any info?

restic -r /mnt/ResticMasters stats    
enter password for repository: 
repository 4102d52f opened (version 2, compression level auto)
[7:57] 100.00%  2413 / 2413 index files loaded
scanning...
Stats in restore-size mode:
     Snapshots processed:  11
        Total File Count:  662065
              Total Size:  299.932 TiB

Yeah, that took a wee while to load the index files, but I’ve still got export GOGC=1 set until the RAM gets put in.

Next up is a complete rerun and see how long it takes to look at all the files and backup again. Once the RAM is in I’ll do a prune and cross some fingers.

g33kphr33k · January 24, 2024, 11:18am

We now have 96GB of RAM in. I’ve removed the GOGC variable.

Running Restic takes just over 4 minutes to read the index files over a 10Gb NFS files share. TrueNAS Scale RAM is 45GB for Services, of which Restic is all but 3GB of that.

To answer the question of “Can Restic backup large datasets and still function fine?”, the answer is Yes. However, you need the RAM to allow it to run and the connection speeds to make it even worthwhile.

betatester77 · January 24, 2024, 7:03pm

Thank you for sharing. It was really interesting to read.

Please keep us posted when you prune or restore first time.