Performance with very large (~400TB) repo?

dmd · January 26, 2019, 2:49pm

I’m thinking of using restic to back up some servers which deal with fairly large datasets.

The total amount to be backed up would be about half a petabyte, with us adding a few hundred gigabytes more each day.

The underlying storage would be a Oracle ZS4-4 appliance, serving via SFTP. Alternatively, if sftp would be a major bottleneck I could run the REST server if that would be faster — but only if I can somehow get a Solaris 11 restic binary. Or I could mount over NFSv4 but I’d greatly prefer not to.

Anyone have any thoughts on this? What kind of performance should I expect at these volumes?

764287 · January 27, 2019, 2:20pm

I think restic can generally handle repositories of this size but they require a lot of memory and commands like check and prune can be really slow (as can be seen here).

While most of my repositories are using SFTP as backend and it’s working pretty well, it’s probably the worst backend performance wise. With such huge data sets you should try out other protocols which are more efficient. Maybe restic & rclone & Openstack Swift?

dmd · January 27, 2019, 4:18pm

The main issue for me is that unless I do some serious surgery, the machine with very fast access to the storage appliance (which itself doesn’t actually let you get a shell - it just serves files via sftp, nfs, ftp, and a few other ways) is a Solaris 11 box. I could probably get a minio binary on there for S3 protocol but anything complex with lots of requirements is real hard. (I tried getting Borg on there and it was just a nonstarter - the latest Python I was able to successfully get for instance was 3.3.)

I’ll start doing some speed tests on Monday.

matt · January 28, 2019, 11:10pm

The rest-server compiles to Solaris; it uses HTTP(S) which is very fast compared to SFTP.

dmd · January 29, 2019, 1:21am

I got rest-server crosscompiled for Solaris today and started loading data, and indeed the ingestion is extremely fast. I’ll report back when I’ve got the whole initial backup loaded (which may take a week or so).

dmd · January 30, 2019, 10:35pm

With the repo size at 120 TB, I’m already running up against memory limits. Is this normal? I’m having restic backup crash while it does the initial scan.

This is from a host with 32 GB ram.

matt · January 30, 2019, 11:44pm

Probably, yes. Memory utilization can be optimized a lot.

@fd0 does scanning actually perform any useful function other than a progress bar? If scanning is technically optional, I could work on a way to make an option to disable it…

fd0 · January 31, 2019, 7:18am

It’s just for the progress bar and runtime estimation, it’s not strictly necessary. I’d vote against having an option for disabling it, in my experience scanning does not cause the backup operation to take longer. On the contrary: sometimes this causes the directory structures to be in the memory cache, so backup is sometimes even faster when scanning was done before…

You can disable it by commenting out these lines:

github.com

restic/restic/blob/1107eef2150c6f886bf599d5bc4d29cc699bbf31/cmd/restic/cmd_backup.go#L491-L498


      
          	sc := archiver.NewScanner(targetFS)
          	sc.SelectByName = selectByNameFilter
          	sc.Select = selectFilter
          	sc.Error = p.ScannerError
          	sc.Result = p.ReportTotal
          
          	p.V("start scan on %v", targets)
          	t.Go(func() error { return sc.Scan(t.Context(gopts.ctx), targets) })

I highly doubt that this will have any effect.

In my experience, memory usage goes up with the number of blobs (and therefore the number of files) in the repo: a few large files are not a problem at all, but lots of small files (like on a mailserver) will cause memory issues.

cdhowie · January 31, 2019, 7:32am

If that’s the case, shouldn’t -q have the side-effect of disabling the scanner since there’s no progress display? I recall discussing this before and you said that -q doesn’t do that anymore.

mlissner · February 2, 2019, 12:45am

We have a significantly smaller dataset (just a few TB), and…it’s bad. I’m a huge fan of restic, but it’s just not suited for this kind of thing. Prune, for example, is an essential function, but it requires a just insane amount of memory if you have a lot of files like we do. We haven’t been able to do a prune since we started using Restic. So far we just let the backup grow and grow (at a small expense). There’s hope that this will one day get fixed, so we’re OK waiting until then.

I think there’s little harm in trying restic, but I’d be almost shocked if you ended up using it. It makes me sad to say that so bluntly, but I think this just isn’t a use case that works yet.

I actually don’t know what you’ll be able to use to back up this much data on a regular basis. We haven’t found anything good — just a scan of our data takes too long, really — so I think we’re moving to ZFS so that we can have snapshots and filesystem-level backups. We’ll see if that helps.

dmd · February 2, 2019, 12:58am

Hmm. My underlying storage is ZFS. What would you recommend, then? Just rsync + zfs snapshots?

dmd · March 4, 2019, 1:58am

We ended up going with Bacula, which so far is seeming to handle our scale (it ended up being about 600 TB so far total). The interface is awful and it’s way more complex than it needs to be, but it doesn’t seem to bog down at all at these sizes, whereas restic basically stopped working entirely for us over about 100TB.

I’d love to see restic work at petabyte or near-petabyte scale, as it’s so much simpler conceptually.

whereisaaron · March 4, 2019, 3:29pm

Oh yeah, we used to use Bacula back when we still used tapes. As I remember a lot of the complication was around removable media management. Does it work for cloud storage backends?

dmd · March 5, 2019, 1:34pm

Yes, though I haven’t tried it – it would be cost-prohibitive for us. Our use case is purely disaster-recovery, so “tapes in a different city” seems to be ideal and cheap.

fd0 · March 16, 2019, 10:06am

I’m glad that you found a solution which works for you! Scaling is hard, and we’ll try to make restic work with repos this size eventually