Backup to AWS S3 Glacier

our company wants to backup 500 TB to AWS S3 Glacier and I was wondering if anyone has done such a thing already with restic.
One interesting question is if it needs to read back data later (and how much) to create increments or if it keeps sufficient local data about chunk hashes.

The other interesting question then is how a restore operation would work (as disaster recovery only) with such a slow archive.

Welcome @bmwiedemann

First, restic currently does not officially support cold storages - it is designed to have instant access to the storage backend.
Moreover there are quite some discussions if AWS Glacier is really a cheap storage for backups as the retrieve costs are probably too high to profit from the low storage costs. But for a desaster-recovery maybe this could be still an option.

I’m trying to make some proposals which would allow restic to work with cold storages, see

Now, about your questions:

Basically, restic needs to regularly access its metadata to run and speed up backups, checks, etc. Metadata is everything except the file contents which are saved within restic in so called “data pack files”. See the link above about which restic functionality works with accessing only metadata.

So the goal is to split your repo such that the metadata is always available (“hot”) and the data pack files are saved in some “cold” storage, in your case S3 Glacier.

With AWS, you can specify the storage class by the path in the repo (e.g. save everything under /data/ in Glacier and use S3 Standard for all other paths in your repo. The bad thing is, that under /data/ all data pack files are saved as well as all tree pack files (which are metadata and frequently used).
So, you have basically two options:

Either you use the experimental PR Add possibility to specify extra repo for "hot" data by aawsome · Pull Request #3235 · restic/restic · GitHub which allows to separate the hot from the cold part. BTW, this would also work if you have two completely different storages for the hot and the cold part. I would however not (yet) recommend this for a production repository.

Or you you just save everything under /data/ in S3 Glacier and rely on the local cache restic creates and uses to access all tree pack files.
This also means, if you loose your local cache, things will get ugly to recreate that cache, but the way would be to identify your tree pack files, warm them up and run your backups to re-create the cache.

Again, see Discussion: How to support cold storages? · Issue #3202 · restic/restic · GitHub for the commands which need access to the data pack files. To perform one of these operations, you need to manually warm up (i.e. move to S3 Standard) the needed data pack files before performing that operation. There are some PRs which implement --dry-run for some operations which would give you the information about which packs these are. For restore, this is `restore --dry-run` option to enable cold storage based repositories · Issue #2796 · restic/restic · GitHub .

Another possibility for a desaster-only repository is to just move everything from Glacier to S3 Standard. After that, all restic operations are fully functional.

2 Likes

@bmwiedemann Did you found solution with S3?

So far, we did backup to local storage and use aws s3 sync /local/path s3://bucket/remote/path

Also kept the caches around, for a better chance to find the right data without pulling down everything (because that is expensive with Glacier).

We might go the easy route and just keep local’ish backups as well and just use AWS as last resort. Will cost quite some extra money, though (even if we would use my preferred ceph for that).