Backup to AWS S3 Glacier

bmwiedemann · May 11, 2021, 1:42am

our company wants to backup 500 TB to AWS S3 Glacier and I was wondering if anyone has done such a thing already with restic.
One interesting question is if it needs to read back data later (and how much) to create increments or if it keeps sufficient local data about chunk hashes.

The other interesting question then is how a restore operation would work (as disaster recovery only) with such a slow archive.

alexweiss · May 11, 2021, 4:41am

Welcome @bmwiedemann

First, restic currently does not officially support cold storages - it is designed to have instant access to the storage backend.
Moreover there are quite some discussions if AWS Glacier is really a cheap storage for backups as the retrieve costs are probably too high to profit from the low storage costs. But for a desaster-recovery maybe this could be still an option.

I’m trying to make some proposals which would allow restic to work with cold storages, see

github.com/restic/restic

Discussion: How to support cold storages?

opened 12:15PM - 30 Dec 20 UTC

aawsome

category: backend category: user interface type: feature suggestion type: tracking

This issue is meant to discuss which is the best way to implement the support of… cold storages in restic. The need to use restic with cold storages has been addressed in several places, see #1903, #2504, #2611, #2796, #2817 and several discussions in the forum. There are also some experimental PRs which kind of allow users to use cold storages, see #2516, #2881, #3196 and the already merged prune improvements (#2718 and following PRs) are already made to support cold storages in future. What should restic do differently? Which functionality do you think we should add? ---------------------------------------------------------------------------------- Allow users to use repository on some "cold storages". These are usually cheap (cloud) storages where writing is fast and cheap, storing data is extremely cheap, but accessing the data is usually extremely slow (or may even need some extra "warming up" before being able to access it) and/or expensive. Some cold storages also require minimum file sizes or are expensive for small files, but this is not specific to cold storages, so I would like to skip this in the discussion here. Examples of cold storages are: - AWS S3 Glacier and Glacier Deep Archive - Google Cloud Platform Coldline Storage and Archive Storage - Azure Blob Storage Cold and Archive - OVH Cloud Archive - Maybe self-programmed backends which write to a tape What are you trying to do? What problem would this solve? --------------------------------------------------------- Allow restic to use cheap storages for use-cases where access to file contents is usually never needed, or users are willing to accept the trade-offs that come with those storages. A use-case is disaster-only backups. What issues need restic to tackle? --------------------------------------------------------- 1. Define how to split the repository in hot/cold parts 2. Reduce read access to the cold storage as much as possible 3. Reduce all API calls to the cold storage as much as possible 4. Add functionalities to get information which data from cold storage will be needed (+ maybe implement some "warmup" possibilities) 5. Add functionalities to let users restrict the access to cold storage, where possible 6. Allow restic to wait for slow cold storage access Maybe some more features are needed... Define how to split the repo in hot/cold parts --------------------------------------------------------- First, I think we should talk about the treatment of pack files containing data blobs on one side and all other files in the repo on the other side. The reason is, that usually more than 99% of the repo size is occupied by those files. Only for degenerated case (many very small files), the tree blobs and the index may contribute significantly to the repo size. Moreover there are a couple of commands that do not need to read/access any "data pack file" and should therefor fully work if only these are located in the cold storage: - `backup` - `cache` - `copy` (destination repo) - `diff` - `find` - `forget` (but not the `--prune` part) - `init` - `key` - `list` (except packs) - `ls` - `recover` - `snapshots` - `stats` - `tag` - `unlock` Using different paths in the repo to save packs containing tree blobs and packs containing data blobs was discussed in https://github.com/restic/restic/issues/628#issuecomment-635506248. This would work for storages that allow to separate hot/cold by paths, like AWS S3. For storages missing that feature however, this would not work and it requires a change of the repo format. I'll call this approach "split-path-approach". Another (similar) possibility would be to use two repos, "repo-cold" (saving "data pack files") and "repo-hot" (saving all other files). This again would be a new repository format. I'll call this approach "split-repo-approach". A third solution is to have a cold repo containing all files (which would then be a standard restic repo) and saving all files expect "data pack files" in a "repo-hot". This approach is in fact a caching approach, so I'll call it "cache-approach". Note that #2516 kind of implements this by using the local cache as "repo-hot", #3235 implements this for a more general "repo-hot". Reduce read access to the cold storage as much as possible --------------------------------------------------------- All three approaches would only need to read data from the cold storage when restic accesses a " data pack file". This is the case for all commands that really need to access file contents. `prune` is already optimized such that it only accesses files that are marked for repacking, `rebuild-index` is already optimized that it should only reads pack files which are not or not correctly contained in the index. Are there other commands that need optimization here? Reduce all API calls to the cold storage as much as possible --------------------------------------------------------- API calls to the "/data" dir are only `Save` (for backups etc.), `Load` (see above) as well as `List` for `list` (packs), `prune`, `check` and `rebuild-index`, and `Remove` for `prune` which I think cannot be improved. So the "split-path-approach" and "split-repo-approach" would already have minimal API calls. In the "cache-approach" the cold storage backend would additionally get every `Save`, `List` and `Remove` for non-"/data" files and "tree pack files". Actually, I don't know about the other API calls, like `Test` and `Stat`. Add functionalities to get information which data from cold storage will be needed (+ maybe implement some "warmup" possibilities) --------------------------------------------------------- I think for most commands, the best would be to implement a `--dry-run` or `--warm-up` option showing which "data pack files" needs to be accessed in order to run the command. Some cold storage do warm-ups when a file is tried to be accessed, this "access-try" could be also implemented for `--warm-up`. This applies to the following commands: - `cat` (for data blobs; for packs this is easy :wink: ) - `copy` (for the source repo) - `dump` - `prune` (see #2881) - `rebuild-index` (see #2881) - `restore` (see #2796) For `check` with `--read-data` or `--read-data-subset n/t` it is easy to determine which "data pack files" the check needs. For random subsets, I propose to add an option `--read-data-from` which allows users to explicitly give a list of pack files to be checked, see #3203. The command `mount` does not really work, as it is interactive and the "data pack files" needed are only known when users access them. So I would not allow to use that command for cold storages or make it just list the directory structure without allowing to read any file. Add functionalities to let users restrict the access to cold storage, where possible --------------------------------------------------------- I think this only applies to `prune`, where a users could give a list of "packs to keep" (see #3196). Moreover, users can use the already existing option `--repack-cacheable-only` which does not repack any "data pack file". In case of duplicate blobs this might be interesting for other commands, but I think this is nothing we need to start with to support cold storages, so I'd like to skip discussion about duplicates here. Allow restic to wait for slow cold storage access --------------------------------------------------------- This might need another logic for timeouts or retries. I proposed #2515, but maybe there are better approaches? Discussion --------------------------------------------------------- - Are there other requirements? - Are there approaches I'm missing? Which approach do you think should be favored by restic? - Other comments?

Now, about your questions:

Basically, restic needs to regularly access its metadata to run and speed up backups, checks, etc. Metadata is everything except the file contents which are saved within restic in so called “data pack files”. See the link above about which restic functionality works with accessing only metadata.

So the goal is to split your repo such that the metadata is always available (“hot”) and the data pack files are saved in some “cold” storage, in your case S3 Glacier.

With AWS, you can specify the storage class by the path in the repo (e.g. save everything under /data/ in Glacier and use S3 Standard for all other paths in your repo. The bad thing is, that under /data/ all data pack files are saved as well as all tree pack files (which are metadata and frequently used).
So, you have basically two options:

Either you use the experimental PR Add possibility to specify extra repo for "hot" data by aawsome · Pull Request #3235 · restic/restic · GitHub which allows to separate the hot from the cold part. BTW, this would also work if you have two completely different storages for the hot and the cold part. I would however not (yet) recommend this for a production repository.

Or you you just save everything under /data/ in S3 Glacier and rely on the local cache restic creates and uses to access all tree pack files.
This also means, if you loose your local cache, things will get ugly to recreate that cache, but the way would be to identify your tree pack files, warm them up and run your backups to re-create the cache.

Again, see Discussion: How to support cold storages? · Issue #3202 · restic/restic · GitHub for the commands which need access to the data pack files. To perform one of these operations, you need to manually warm up (i.e. move to S3 Standard) the needed data pack files before performing that operation. There are some PRs which implement --dry-run for some operations which would give you the information about which packs these are. For restore, this is `restore --dry-run` option to enable cold storage based repositories · Issue #2796 · restic/restic · GitHub .

Another possibility for a desaster-only repository is to just move everything from Glacier to S3 Standard. After that, all restic operations are fully functional.

jnemeiksis · September 2, 2021, 11:56am

@bmwiedemann Did you found solution with S3?

bmwiedemann · September 2, 2021, 1:59pm

So far, we did backup to local storage and use aws s3 sync /local/path s3://bucket/remote/path

Also kept the caches around, for a better chance to find the right data without pulling down everything (because that is expensive with Glacier).

We might go the easy route and just keep local’ish backups as well and just use AWS as last resort. Will cost quite some extra money, though (even if we would use my preferred ceph for that).