Anyone use restic with large data sets?

I’m considering using restic to back up ~500TB of scientific data. I’ve had trouble finding anyone with experience using it at that scale. Any stories?

(The underlying storage would be an Oracle ZS4-4 appliance. Yes, I know about zfs send. The client machines don’t have zfs, so that’s not an option.)

Some threads I found on the forum about large datasets. Please keep in mind that restic’s performance depends on a few factors like hardware, backend and dataset (number of files and snapshots) etc.

Note that “amount of data” is less important to restic performance than “number of files.” We’ve seen cases where a repository much smaller demands many GBs of RAM to operate on due to the size of the indexes, because the backups contain hundreds of thousands of tiny files.

If your files are each at least a few MBs then you will have a much easier time using restic.

I’m currently testing restic 0.9.4 with 11TB, ~5M files data set. Backend is sftp. The backup part works fine. After original hassle of putting 11TB to the remote end subsequent sweeps only take about 1h. I noticed, though, it might consume upto 5-6GB of RAM but I can afford it on 128GB RAM server.

The restore part looked problematic, but it seems that there will be the solution soon (Restic 0.9.4 is still slow on restore (sftp backend)).

The real pain is the ‘prune’ function. It took 6 days to complete it. Since I have ssh access to the remote storage server I’m trying to run ‘prune’ locally on it now in the hope that it will work faster with the direct access to the repository. It has been running for 28h so far and has not finished yet.

1 Like

Hello everyone,

Posting my experience, as requested: after many months and a lot of trouble, I’ve finally managed to successfully backup our local NAS server to our business GSuite Google Drive account using restic.

The source data resides in 3 ZFS pools totaling ~62 million files and ~26 TiB, and each pool is first replicated from the main server to a dedicated backup server (via zfs send | recv) and then backed up from this backup server to the cloud using restic backup for each individual pool (so we have a separate snapshot for each pool) all in a single restic repository.

It took almost nine months until we had all data on the cloud, due initially to not having a dedicated backup server with enough RAM to run restic, then due to insufficient internet bandwidth, but in the last couple of weeks due to restic backup locking up intermittently (this happened a few times and unfortunately even with a lot of effort and time spent and a great deal of help from @fd0 (BTW, thanks for everything, @fd0!), we so far haven’t managed to even diagnose (much less fix) the root cause for those lock-ups; the solution was to kill the restic process and start over every time (fortunately not having to upload all of the data again thanks to restic’s deduplication).

Now we run restic backup daily to update these backups, and it takes almost a full day (anywhere from 19 to 22h) for the three pools – we are trying to optimize this but with no success so far. All we can say is that it’s not CPU, memory, local disk bandwidth nor internet bandwidth that’s holding back restic – the issue seems to be internal to restic itself. We are continuing to try to resolve this, but in the short term we worked around it by moving the backup for the largest pool to be done only on friday nights – so it has the whole weekend to complete, and during the week the other two pools have more than enough time to finish.

The judge is still out on recovery – I managed to finish my first recovery today, after some more trouble, covering a relatively very small subset of the whole backup (less than 185K files using up 72GiB), but I found that two files had their content corrupted (on one of them, 13 bytes have apparently randomly changed values, and on the other 8 bytes were zeroed out). Perhaps this was not restic’s fault as I missed making a snapshot of the restored files right after restic finished, and the restore was to a network shared directory where conceivably something else could have messed with it – so I will try to repeat that on the near future, on a more controlled setting this time, so as to be able to pinpoint exactly where the corruption was introduced.

EDIT: the corruption seems to be real: I repeated the restore to a local protected directory and the exact same corruption is still showing; we’re trying to track down its cause, but so far it doesn’t look like hardware (the machine has ECC RAM and the source comes from ZFS). Very worrying… :-/

Overall, I’m very happy with restic – it has a great user and developer community, and even with the troubles I went through, it has enabled me to backup a volume of data to the cloud that would otherwise have been impossible.

Cheers,
– Durval.

1 Like

@durval Thank you for sharing your expierience.

I would really love to see the difference in your case when using a backend which at least I found much more faster. Like S3 for example.

Maybe you want to get a trial account at (no ad - just for example) wasabi.com and play around with it.

1 Like

You are welcome! :slight_smile:

S3? Sure, just send a couple thousand dollars my way and I will be happy to do that for you :wink:
(seriously, we did some estimates on using S3 and it was crazy expensive, like in 12 months we would be able both to buy a complete duplicate of our NAS server to place in a remote location in a nearby city and pay for electricity and internet fiber connectivity there – twice over. So S3 is definitively out due to cost)

When we were starting all that last year, we did some experimenting with rclone directly to Backblaze B2 (ie, no restic – had not heard of restic yet, at that time) and it was not much better than Google Drive, that’s why we are sticking with the latter.

Just checked Wasabi and pricing is much more reasonable, would cost ~$100/month for our restic cloud storage needs, which would bring it into the realm of “feasible”. But, it seems their trial account limits usage to 1TB, which isn’t nearly enough for a real test :-/

Cheers,
– Durval.