Slow backups with B2 - what's the bottleneck?= - 102TB/ 2.8M files

Hello,

I have been experimenting with Restic and like all the functionality but the performance for some reason is poor, and I would like to understand if there is anything I can do to speed it up.

I did my initial backup over NFS to set of drives (“Fireball” in Backblaze speak), and then the backup files were transferred to B2. I thought that once the initial backup is done locally the subsequent updates over the network would be fast. On local NFS operations, the whole operation took about 2 weeks.

I have a quite big set of files about 2800000 (2.8M) files and about 102 TB, and I am backing it up over a 1Gbps connection to Backblaze.

I have now done this two times, and the backup feels slow:

  • My first backup was on a Linux box (which had the 102TB of files locally) but it had only 8GB memory (and Intel Quad Q8400@2.66Ghz), and the restic process died just about when it was finished, or it was killed by linux because of out of memory. (I’ve had to set the GOGC=20 to get it to run at all otherwise the process was just killed before the backup even started during the index load. The backup had about 5TB of new files when I did it. The whole process took about 22 days, and most of the time it was not transfering the files but just gong through the files locally. CPU usage wasn’t very high either, so how is the time spent?
  • I am now running the backup on a Synology 1821+ device (which has 32GB of memory). It’s been running for 24h now and current ETA is for completion is over 800h which is something like 30 days. The CPU usage is not very high (approx 20%), nor is the network output saturated.

Am I correctly assuming the process is mostly waiting for local IO as all the files must be read (and hashed)?

I am hoping that once the second backup is done on the same host, it would be a speedier process thanks to caches but wondering where is the time spend now are there ways to speed things up?

Replying to myself here. My backups are slow because reading over the network is slow due to congestion. So I guess subsequent restic backups done in the same host should be a lot faster because of caches. Just calculated that just reading the 102TB over 1Gbps takes almost 11 days and I cant’ just assume the full link capacity here. The slow network also explains the low CPU utilization.

I’d really recommend to split the backup over several repositories. Especially the performance of prune will probably be rather slow for such a large repository.

Thanks Michael for comments.
Is there an easy way of splitting without retransmitting?

I am loving Restic btw, elegant design and good documentation. Thanks for your work on it.

Unfortunately no.

100TB!! Jesus Christ:)

It’s almost like Backblaze itself. What do you hoard?

I will divide into 50–100 repos. One repo makes no sense at all, even if successful.

100TB is actually quite little these days… We save lots of videos…

It would be good to mention the recommendation on the restic documentation if there are some size recommendations for the size of the repository. Now I have wasted a huge amount of time, money and resources in trying work on the one backup, which I may need to just redo again.

Not sure how splitting my repo to somehow 50 repos will make the job any simpler. There is no natural split I could do, except maybe with some artificial time intervals.

Am I just using the wrong tool?

Outside ZFS, I don’t think it would work smoothly with any backup application. These backup applications operate similarly.

100 TB is crazy. The amount of RAM, cache and CPUs is substantial. It would be interesting to see what others say.

Breaking it into several parts most likely works. Even if one repository works, I wouldn’t do it this way.

Just my 2cts:

  • If you backup only/mainly videos, I think the chunking used in restic may not be optimal. I don’t know anything about video format details, but I can imaging that fixed size chunks may be able to do a similar good job and I reckon larger chunk sizes would be also quite good here. Currently the average chunk size for your use case will most likely be around 1,5MiB which means you’ll arrive at 66M data blobs…
  • Moreover, currently the pack sizes are about an average of 5MiB size. That means for your 100TB you’ll get roughly 20M pack files which IMHO is far too much. There are open PRs which would tackle this problem, but they are not yet merged into restic.
  • Taking into account the large number of blobs and pack files, you’ll get 1) quite high memory requirements and 2) high costs for processing your backups, especially for pruning as @MichaelEischer correctly pointed out. Both issues would profit from larger chunks and larger pack files for your use case. Also there are not (yet) all optimization potentials used for efficiently processing that large repositories.

Here is a rough estimation of your memory needs:
66M blobs in theory take ~3,8 GiB and 20M pack files in theory take ~640 MiB which sum up to around 4,5 GB memory. This is the theory, but in practice you’ll have to multiply this with the factor 4 or 5, so I guess you’ll need around 24GB of memory.
(side note: my Rust implementation rustic occurs to be much less memory hungry, here I would estimate around 6 GB of memory for backing up).

So, to conclude: IMO the restic format is suitable for your use case, but the defaults for chunking and pack file size is unfortunately far from optimal for you use case. So you either use a self-patched version or live with the shortcomings.

Thanks. Very nice info. Actually your calculation was not very far off. My process uses 8GB of memory (which again isn’t very much so not a big problem these days, except maybe for CPU caching) but I tuned the GOGC variable to reduce the memory usage, so more aggressive garbage collection is used.

Something like this would be very good information to include in the restic documentation btw. I don’t think there is an easy way to change the pack size or chunk size now without a complete re-upload.

Ie. I propose a chapter to be added to documentation with title: “Designing your repository” or some such that discusses the various tradeoffs and optimizations. Or then somekind of analysis tool that could be run before initial backup, that would optimize the settings.

So far my incremental backup has been running 16 days, and has about 74% checked (mostly not updated, as only 10TB of new content) I can report on subsequent runs, as this is the first run on the new host. I hope new runs will be faster thanks to local cache.

1 Like