Ideal repository size / count?

thedaveCA · February 21, 2019, 11:26pm

I’m setting up a new Restic installation on a ~250GB set of data.

High level: Should I use one single repository, or split into 2 or more?

The two sets of data, the first is 94GB made up of 861,761 files, the second is 130GB and 40,555 files. There will be no duplicated content between the two sets.

I’m backing up directly to B2. While a single repository is easier to work with, a worry about database maintenance operations (prune), and recovery from scratch in a full disaster recovery (where one set is very important, while the other is completely unimportant and can wait weeks).

Would it make sense to use a single repository or multiple repositories?

cdhowie · February 21, 2019, 11:34pm

Advantages of using a single repository:

Configuration is simpler; there is only one repository to run operations against, so you don’t have to run prune twice, for example.
A single restic snapshots invocation will tell you about all of your backups, simplifying reporting.
Even if there does not appear to be duplicated content, there very well could be, particularly if the backups are full-system.

Advantages of using multiple repositories:

Lower peak memory consumption for backup/prune operations.
If the two data sets are not accessible by the same users/roles, having separate repositories will allow you to restrict access to one from the other. This is not possible with one repository – if you can create backups, you can read everything in the repository.

thedaveCA · February 22, 2019, 3:09am

Hi!

Configuration and simplified reporting, scripting and usage is my primary motivation to want everything in one repository.

In general, sure. In this case, it won’t happen as one set of data is already deduplicated, compressed and encrypted so although there technically is duplication, restic can’t see it.

I’m not super worried about memory consumption for this one, but rather, B2 bandwidth/requests and time that typical operations take to complete. This one is tough to simulate as I don’t yet have a few dozen snapshots over time.

cdhowie · February 22, 2019, 6:19am

If this is the case, then I would honestly be surprised if restic even offers you any benefits over simpler backup tools for this particular set. Deduplication of compressed and encrypted data is going to be totally ineffective. If it’s encrypted, you don’t need the encryption offered by restic.

The encryption and deduplication offered by restic will be largely negated, and the encryption of restic will be a waste of CPU.

If you plan to prune both on the same schedule, then any difference will likely be negligible.

Even if you don’t, I can’t imagine you’d prune more than once per day, and this isn’t too expensive of an operation on a repository of that size.

If you can prune once per week or less frequently, all the better.

thedaveCA · February 22, 2019, 8:26am

Agreed. However, Restic does a great job of offering snapshots (and an effective way to manage/purge older data), and since it is a good fit for all of our other needs it makes sense to stick with the same tool across the board even if I don’t need all of what Restic offers.

In other words, the human cost of maintaining and understanding two systems outweighs the CPU cost of encrypting twice, and Restic is a perfect fit for the other set of data.

Probably the same schedule. I may forget using different parameters, but this is easy enough using tags.

With my other repositories I have multiple machines using a shared repository which is stored locally, each machine applies their own forget routines, with one central server responsible for both pruning and creating an off-site mirror. The repositories are split based on security/permissions boundaries. But I can’t really determine what a forget or prune would cost me if I were paying for transactions as my current environment all stores data locally and only periodically mirrors the results off-site via rclone.

My new environment is a bit different, this is a colocated server without sufficient local disk space to store a backup, so instead I am backing up to B2 directly and therefore Restic’s efficiency in terms of transaction costs and downloads is an unknown to me.

Perfect, thanks!

I was planning on running a prune no more than once a week, or maybe less if the B2 transaction and/or download costs exceed the storage cost for the amount of data that ends up being pruned.

764287 · February 22, 2019, 8:46am

Keep in mind that restic needs some diskspace for caching. On a repository with ~400k files and 500 snapshots restic uses 15GB as regular cache plus a few GB as --cache-dir when running restic check.

cdhowie · February 22, 2019, 9:39am

There is always --no-cache, but this means restic will have to download the indexes every time it runs, which could be expensive.

thedaveCA · February 22, 2019, 3:46pm

Interesting, I hadn’t realized the cache would get that large. I do have the available space right now, although that varies considerably over time. I’ll note that the cache size will increase in our plans and either adjust the number of snapshots retained or ensure the disk space remains available.

Thanks!