Restic and S3 Glacier Deep Archive

Hi all,

This is less a question rather than some thought sharing.

A few days ago, AWS announced a new S3 tier, called “Glacier Deep Archive”. As far as I can tell, it takes the existing Glacier product properties even further, most notably in terms of time-to-restore (comparison) and pricing. At $0.001 per GB-month, storage is an order of magnitude cheaper than what competitors such as B2 offer (restore prices are a different story, though). The downside, though, is that with restore times of up to several hours data isn’t available interactively.

Even if only the /data subdir is moved to Glacier, restic wouldn’t be usable for much more than just adding additional backups. For everything else (such as prune) data would need to be moved back to a standard S3 storage class first, which only makes sense after Glacier’s minimum storage duration of 180 days and costs the same as 2.5 months worth of storage. I see two possible scenarios based on restic’s current feature set:

  1. Append-only backups. Given low storage prices you’d just keep old data, even more so if data gets rarely deleted.
  2. A restore every > 180 days, prune, then move back. Ideally, you’d only restore those packs which were written > 180 days ago and then prune snapshots older than 180 days. Restic shouldn’t need access to any files younger than that, I haven’t tested this, though.

Possible scenarios based on extensions of restic:

  1. Restic could implement some sort of “lightweight” prune where it never validates or re-packs any existing data and only performs delete operations on packs where 100% of the contained blobs aren’t needed any more. Although this would result in some overhead (partially required packs), it would still free up some space compared to 1).
  2. An optimization based on the combination of 3) and 2): First, restic would “lightweight-prune” superfluous packs, then only restore the packs which are left && needed, perform a “proper” prune (incl. re-packing) and then move data back to the Glacier storage tier.

I think 1) or 2) might be valid approaches to consider for secondary backups where you expect to never need them. 3) and 4) probably wouldn’t warrant the effort – most likely there are better alternatives for such use cases. I’d be interested to know your thoughts on how this new offer could be leveraged in combination with restic!

We had the same conversation when the Glacier storage tier (compare to Glacier Deep Archive) was announced and it honestly isn’t a good fit for restic for several reasons:

  • The 180-day commitment makes pruning very difficult to do without causing early deletion fees. Simply pruning every 180 days doesn’t work because packs created one day ago could still need to be rewritten.
  • Unless parent snapshots are simply not used, restic needs to be able to fetch tree objects on-demand, which necessitates data packs being readily-available.
  • All of this added complexity restic brings does not provide a benefit in a coldline storage mechanism; simple differential or incremental backups are much more effective and easier to manage, and don’t come with a long list of caveats. The advantages of restic basically disappear when using coldline storage.
  • Restic, by design, is built to work on multiple storage backends as long as they all provide a certain minimum set of features. Supporting one storage vendor’s specific features adds code complexity for the benefit of only people who use that service, and in the Glacier/GDA case the amount of complexity would be incredibly high to prevent things like early deletions, for fairly minimal gain.

Edit: To give perhaps the best illustration of how unwieldy this would be, consider the workflow when you want to restore a single file.

  1. You ask restic to restore the file /a/b/c/d from a snapshot.
  2. Restic doesn’t have the requested snapshot’s root tree object so it goes to fetch it. This fetch fails because the containing pack hasn’t been restored to S3. Restic complains about the missing pack.
  3. You go to the S3 console and restore that pack, wait 12 hours for the data to become available, then re-run the restore.
  4. Restic is now able to look at the snapshot root directory and looks up the tree object ID for /a. This is in a different pack. You jump back to step 2, and repeat this process four more times until restic has the blob for the file.

It would take ~60 hours in the average case, assuming that all of the file’s data blobs are in the same pack!

If you have to restore a whole snapshot, you might as well issue a restore on all of the data objects in the repository, because you’re probably going to need ~50% of them anyway and it would take too long to manually trudge through this process.

Contrast this to a differential backup where you restore and download a maximum of two files and you’re done.

Thanks for your detailed response, @cdhowie!

I fully agree with you in terms of restoring. In that case, a full restore is probably the easiest (and only) option.

Based on your comments it looks like we have to rule out pruning entirely, so this basically leaves us with one-off backups or append-only backups in “blind flight” at best as the only options with Glacier/GDA. Although the same could be achieved using traditional tools such as tar + gpg I guess there’s still a use case for restic if you already use it with other backends and you don’t want to set up an additional workflow, keys, etc.

Right, so one has to weigh the cost of implementing a different backup system against the cost of having to initiate an S3 restore out of GDA for an entire repository any time anything needs to be restored from backup.

I’m not sure Glacier Deep Archive (GDA) should be used with Restic at all. I use both, for different purposes - Restic for versioned backups, GDA for archives. I outline how I use the Infrequent Access class of storage with Restic in this issue. This is still the way I plan to use Restic with S3.

I used to have my archives on Glacier, as opposed to in S3 in the Glacier tier. I’ve now moved those archives to S3 and the GDA tier. I use “aws s3 sync” command line for this, keeping a copy on my PC. GDA is great for archives, and may work with some other forms of backup software, but without significant changes I wouldn’t think it would suit dymamic block based archives like Restic.

AWS has announced a new storage tier called “Glacier Instant Retrieval”. It’s almost as cheap as the other Glacier tiers, but retrieval time is milliseconds - much better for backups like Restic. In the CLI you use the class GLACIER_IR

Restic doesn’t seem to support storage tiers, but adding support for this could be good. For now using S3 lifecycle rules to transition objects in the data folder to this storage class could save some money over S3 standard or IA class.

If not, then rclone will :slight_smile: See their docs https://rclone.org/s3/#s3-storage-class

In some cases I backup with Restic locally then use the AWS CLI to upload using the storage tier I want, but on my web server I have Restic send data directly to S3 to reduce disk space. It’s the second scenario it would be useful to be able to specify storage tiers.

If the egress fees for AWS Glacier instant retrieval are same as other tiers , then this would be too expensive.

Testing a 1TB repository would cost around 100$ if I am not wrong, compared to near free egress with competitors.

Can someone clarify if egress fee could be avoided in AWS?

AWS does charge a lot for data egress, $0.09 per GB, which is $90 for 1TB. You also pay fees for the API requests, which typically don’t add up to much. My restore tests tend to be targeted to a small number of files, I don’t expect to ever have to retrieve a lot of data from S3 as I also have the data on local disks. For me it’s a last line backup. If you plan to test / retrieve large amounts of data then S3 is probably not the best option.

As a comparison BackBlaze B2 costs $0.01 per GB to retrieve, so $10 for 1TB. They cost $0.005 per GB for storage, compared with $0.004 per GB for S3 glacier instant retrieval. B2 is probably a better option for most.

I use AWS because I’m very experienced with it, I understand it well, and I don’t store all that much data in S3 that I don’t have somewhere else like a backup disk.

2 Likes

This looks like a good fit for restic’s /data directory for backups that are almost never restored. Retrieval is instant using regular S3 APIs but expensive.

Does restic ever read from /data other than during restore and check --read-data?

Yes, during prune when restic downloads blobs to repack them. It would also delete from /data on that occasion.

1 Like

I see. That could be expensive with Glacier then.

Is there any way it could make sense to use Instant Archive with restic? The storage price is around 40% of S3 infrequent access, so it’s quite attractive, as long as the high retrieval costs and the 90 day minimum storage are acceptable.

There are lots of ideas how to work around the limitations of cold storage, you can find some of them in this thread and in Discussion: How to support cold storages? · Issue #3202 · restic/restic · GitHub.

All in all, it’s still rather cold storage, at least from a pricing perspective, so the same general principles apply as for any other cold storage: use it for (additional) one-off backups which you’d only access in exceptional cases. Use regular storage for daily use.

That’s interesting about the prune rewriting blocks. A 30 second look at my S3 restic data folder suggests that a lot of blocks are left untouched, and my AWS bill has $0.00 for Glacier Instant Retrieval - which suggests it’s non-zero but less than one cent.

I guess if you’re backing up content that’s mostly static Glacier Instant Retrieval is fine. If what you’re backing up has a lot of deletes then maybe another tier is better, or rarely / never purge the repo.

Even S3 Instant Access has retrieval costs and minimum storage durations. All of those costs are dwarfed by the S3 download charges though, so that could make restic expensive to use with S3 if it is downloading much data to repack it. Though I see prune has a --max-unused switch which can be told how much data to repack, including none.

As far as retrieval costs, S3 Intelligent Tiering might even be useful, because you get the lower storage costs (after 30/90 days) but don’t pay any retrieval costs or minimum duration.

Under /data, restic saves pure data packfiles, but also tree packfiles which contain the metadata of all trees/directories.

Tree packfiles are accessed quite a lot, but they are cached. This, however, means that you have to download them (to cache) if you loose your cache or if you backup from different machines to a single repository.
Also, make sure that the cache is really used and don’t use the --no-cache option!

About prune, there is already the --repack-cacheable-only option which prevents prune from repacking data packfiles.

That said, I’d like to advertise rustic again. There, support for cold storage is already implemented (including multiple access to a repository / having --repack-cacheable=true as default for hot/cold repos / added possibility to allow prune to repack pack files only after they reach a given age).

1 Like

I am not sharing the repository between servers currently so the cache is not a problem.

Thanks for the tip about rustic. This looks really interesting, and I like what you are doing with the config file too. This might be best asked over on your github discussions, but what combination of settings would you recommend using for S3 infrequent access and/or S3 glacier instant access?

I use S3 as my third tier backup, run daily. I backup to an internal disk then mirror that to S3. I also backup to another disk that’s close by. If I have to restore I would tend to restore most data from one of the local disks, then the nightlies from S3. If both of my local disks are destroyed I suspect I’ll be happy to pay S3 charges.

There are probably ways to reduce egress charges. For example, spin up an EC2 instance and use an S3 gateway to download the files. From there you get 100GB free vpc egress monthly or 1TB free CloudFront egress monthly, if you can work out how to get your data out that way.

What would be your incremental data backup solution for glacier?