Restic for 100TB

Fluppy · August 28, 2023, 1:25pm

I’ve recently migrated a production server from Synology to TrueNAS and am looking for a good way to have versioned backups offsite. Been looking into Restic and Borg mainly and saw a few posts from a couple years ago talking about Restic’s performance with large datasets. I’ll be running the backup task daily during off hours, probably in a Jail on the TrueNAS directly which has fairly beefy specs (EPYC 7713P, 1TB ram, all SSD).

Can anyone comment on this?

rawtaz · August 28, 2023, 3:29pm

I’m guessing you need to be prepared for restic to use quite a bit of memory to keep track of 100 TB backed up data.

kapitainsky · August 28, 2023, 3:48pm

I would see this like this. Native ZFS content deduplication requires something like 1GiB of RAM for every TiB of storage space. Restic will be much less demanding. But still its resources usage will grow with repo size. How much? There is no simple maths formula.

If you have different type of data you do not expect deduplication to help massively - split it to separate repos. Bigger your repo grows more you are moving into unchartered waters. Big repo people usually do not share much experience. This is reality. It is open source project - so there is no SLA or any marketing promises.

nicnab · August 28, 2023, 4:10pm

Maybe it’s worth contacting this guy from CERN and see if he’s still using restic. CERN usually has a bunch of TBs flying around.

ZeB · August 30, 2023, 9:50am

That size of backup is somewhat land of unknown adventure, because it is difficult to really test all it edge cases. You may consider/test Kopia, because there are users reports of backups of similar size.

MichaelEischer · September 3, 2023, 8:38pm

As a rule of thumb, I’d be careful with any backups that are larger the 10 million files / 10 TB. Given enough RAM (1 TB RAM is totally sufficient, even for a much larger backup), restic should be able to complete such a backup. Then big question is mainly whether the performance will be good enough. (restic 0.16.0 should be a lot faster for that size than earlier versions). 100 million files / 100 TB (an order of magnitude more) are likely the maximum for a “still somewhat usable” repository size. Anything larger is likely just too much.

The most important question is, however, which duration for restoring a backup is acceptable. Is it a day, a week or even longer? This is somewhat aggravated by the fact that currently backup/restore and repository maintenance (forget/prune) cannot run at the same time (Although it’s always possible to just cancel a running prune operation). Restoring a 100 TB repository will require at least one day when continuously saturating a 10Gbit-Ethernet interface. That is, if the combination of host system + restic + backend is able to sustain that throughput.

Eli6 · September 4, 2023, 9:29am

Another question is, what other backup program could do it? They all face the same issue.

You probably have to go to file system backup.

thiscantbeserious · September 11, 2023, 7:55am

Maybe my insight for around 6TiB of Data with a lot of small files (6.9 million files) onto spinning Rust can help (many of it deduplicated due to using MergerFS backing but both the individual drives and the share as a whole) .

Edit: Forgot to mention using restic 0.16.0

I have set compression to auto (=default) and pack-size to 64 (MiB) and I’m using the rest-server with --append-only as a really light-weight target on an old Synology NAS placed off-site (dual core atom, 1GB RAM).

I’ve been previously trying to use Minio S3 with IAM rules seperated for admin (basically for prune) and backup task (just write, no delete with Object Locking and Versioning enabled). If the Atom Chip wouldn’t be i686 meaning that I had to compile Minio myself (which led to crashing) I would likely still go that route because it’s a lot more sophiscated.

I’m running a restic backup job per HDD in paralell - thats something that you can’t do with a solution like borg. That’s something you can only do if you really use a solution like MergerFS or Unraid … so not likely something you are looking at. However you could seperate the jobs especially if you operate on a folder level - even on something like FreeNas/TrueNas - that would give you the advantage of faster snapshots = milestones because the initial backup in the TiB range as a single job would be massively painfull …

Restics strong points isn’t in regards to stopping in the middle of a backup-job - so I would really move towards that for the amount of data you’re looking at.

rest-server is really performant and simple - and does the job without crashing so far.

I really recommend it too … especially if your target device isn’t that performant.

Some observations:

You need to experiment with the correct pack size and compression level for anything in the TiB range - otherwise you’ll regret it later on
Step 1 needs to be carefully done before you start the finall backup
I would do all of that locally, in case you’re planning to do that off-site, only move to off-site once you’re satisfied - that also enables you do do just incremental backups

# restic stats

repository 27aa4cc4 opened (version 2, compression level auto)
scanning...
Stats in restore-size mode:
     Snapshots processed:  4
        Total File Count:  6938069
              Total Size:  5.995 TiB

# restic stats --mode debug

repository 27aa4cc4 opened (version 2, compression level auto)
Collecting size statistics

File Type: key
Count: 1
Total Size: 445 B
Size            Count
---------------------
100 - 999 Byte  1
---------------------

File Type: lock
Count: 2
Total Size: 308 B
Size            Count
---------------------
100 - 999 Byte  2
---------------------

File Type: index
Count: 238
Total Size: 197.403 MiB
Size                    Count
-----------------------------
      1000 - 9999 Byte  1
    10000 - 99999 Byte  2
  100000 - 999999 Byte  197
1000000 - 9999999 Byte  38
-----------------------------

File Type: data
Count: 43529
Total Size: 2.697 TiB
Size                      Count
-------------------------------
    100000 - 999999 Byte  1
  1000000 - 9999999 Byte  1
10000000 - 99999999 Byte  43527
-------------------------------

Blob Type: data
Count: 3640242
Total Size: 2.696 TiB
Size                    Count
-------------------------------
          10 - 99 Byte  13852
        100 - 999 Byte  470915
      1000 - 9999 Byte  517770
    10000 - 99999 Byte  412740
  100000 - 999999 Byte  1044805
1000000 - 9999999 Byte  1180160
-------------------------------


Blob Type: tree
Count: 526361
Total Size: 511.236 MiB
Size                      Count
--------------------------------
            10 - 99 Byte  3
          100 - 999 Byte  479193
        1000 - 9999 Byte  43985
      10000 - 99999 Byte  2809
    100000 - 999999 Byte  345
  1000000 - 9999999 Byte  24
10000000 - 16777216 Byte  2
--------------------------------

So far I do see no slowdowns and my machines are a LOT less powerfull then what you’re operating on - on both ends … then again I just have a fraction of that here - and I’m not using SSD-Cache of any sort or zRaid/Raid of some sort to speed up my spinning Rust here!

thiscantbeserious · September 14, 2023, 10:50am

I tried Kopia shortly myself - while it seems really nice, and compiled successfully on i686 setting up a repository server turns out to be a pain in the ass because they enforce TLS → HTTPS with a valid (!) certificate so even something like a DDNS certificate from Synology is out of question here …

That is in addition to basic auth AND user@host + pw authentification + repository password …

From benchmarks I’ve seen kopia chuns quite better with lots and lots of small files - so I’ll likely try again later … but for now restic + rustic + rclone is sufficient for me …

… just in case: in-place restores is a must unless your backup repostories backbone is really strong and you’re not hosting @ home.

thiscantbeserious · September 21, 2023, 8:18am

Little bit off-topic but a follow-up fair warning for anyone looking into Kopia as a “reliable” backup solution:

The repository server is nuts, I finally got it to run (every client is supposed to hold a sha256 signature of the TLS-Key as a security measurement, making it really inpractiable for example with letsencrypt/key rotation when the key rotates on the server → every client will have to adjust its signature) - then it stopped connecting after restarting once and creating a brand new / fresh repository.

There’s lots of layers and security measurements … but seems like they’re hitting the fence when it comes to real world usability. No clue how people even managed to push 100s of TB into a solution like this … most likely using the UI and uploading movie collections or from a lab/cloud perspective, data that isn’t to important - feels really really ikky.

TLDR: Not once did I have an issue like this with Borg or Restic - I think I’d never trust a solution like this with my impotant data …

fede · September 21, 2023, 11:07pm

Thanks a lot for the info.