S3 Deep Glacier, any experience with costs?

I’m considering the option of using restic with S3, but I’m afraid of the access costs.
Is it a good idea, from a cost standpoint to use restic at all?
If anyone could share their experience and their bill for insight purposes, that would be really helpful to make up my mind.
I’'d be using it to backup my personal pics and videos, once a month or so, just pushing more and more incremental stuff, never to be (hopefully) download again, as it is my DR solution.
300 gb aprox with an average of new 2 Gb per month or so…
How much does restic need to GET and PUT? How much DL traffic does it generate? and how much change in the stored files?
Regards!

I’ve been using filebase.com, wasabi.com and Backblaze B2 for my repos. I have large (several TiB) in “append-only” mode (never ran prune).
They all work well with Restic and are far simpler and cheaper when it comes to estimating costs and keeping them under control. I like Filebase and Wasabi because they don’t have egress or API costs so I can use my repos as deduplicated storage repositories to consume documents, media files, etc.

2 Likes

My understanding is that objects in S3 Glacier and Deep Glacier classes are in general not accessible in real time. This model does not work well with restic, which expects real time access to repository contents.

I learned this the hard way a couple of years ago when I unintentionally set up a rule that automatically transitioned objects to Glacier. Once transitioning began, restic operations on the corresponding repository (understandably, in retrospect) failed completely because it could not access the transitioned objects. Possibly things have changed since then, but that was my experience.

FWIW, restic works fine with objects in S3 Standard Infrequent Access. It’s not as cheap as the Glacier classes, but it is cheaper than Standard and (the last time I checked) competitive with Backblaze B2, which I also use. I transition repo objects to SIA automatically after 60 days and although I do incur some charges associated with early access when I prune they are small, typically a few cents per month.

I’m not sure how to address your questions around cost, as the answers will depend strongly on exactly what you do and how you manage your repositories…a prune operation will generally result in much higher data usage than a backup operation, for example. That said, just to give you an idea, my backup situation is quite similar to yours in terms of size and monthly activity, and I spend about 4 $/month on S3. I run daily backups and a monthly forget/prune. That’s with most of my data having transitioned to SIA; it would of course be more if my data lived in Standard.

There are other factors that you may or may not consider important. One is speed. For me, S3 is much faster (maybe 5x faster) than B2. This may be only my location (the AWS data center I use is much closer to me than than the B2 center), but it is what it is. Another is durability. AWS promises higher durability than B2, and depending on your level of paranoia you might care. (I should emphasize that I use both services and that I’ve never lost a bit with either; it’s just a question of how many 9’s you want to pay for).

As @underhillian correctly pointed out, restic is not (yet) ready to (fully) work with S3 Glacier or Deep Glacier.

However, there are already possible settings that work.

First, about your question about change in stored files: restic never changes any file on your storage backend as the file name eqals the SHA256 of its contents. So to “modify” the content, restic always has to delete the file and save a new one.

Secondly, you have to know that restic saves the content of the files you backup in the /data/ directory of your storage backend. There is also quite some metadata like the directory tree structure which is also saved in /data/ or an index saved in /index/ and so on. Usually the metadata contributes only to a very small fraction of the size the repository needs on your storage backend. Moreover, almost all metadata is locally cached.
So if you only access your repository from a single machine and manage to

  1. only store the /data dir in S3 Deep Glacier
  2. never loose your local repository cache

basicially all restic command that don’t access the contents of files stored in your repository will work. They will only need to download minimal parts and basically do some LIST on your storage dirs. The PUTs obiously depend on what your are doing, especially how much data you add to your repository.

Commands that are working in this scenario are:

  • backup
  • check without --read-data
  • ls, diff, find, stats
  • forget and prune with --repack-cacheable-only or --max-unused=unlimited (not included in 0.11.0 and before, but in the latest betas)
  • snapshots, tag
  • copy if this is the destination repository

Commands that are not working in this scenario are:

  • restore, mount
  • cat, dump,
  • check --read-data
  • rebuild-index
  • copy if this is the source repository

If you need any of these commands you first must restore the needed files from Deep Glacier to standard S3 in order to access them. Unfortunately, restic doesn’t tell you which ones, so you will most likely end up with restoring everything to standard S3. The same applies if you loose your local cache to get access to the files needed to rebuild the cache.

I plan to work on enhancing the support of storages like S3, but realistically this will take at least a year before we get “acceptable” support.

With the latest beta, this is no longer true. I also suspect that if you use this, it will decrease your bill as it need much less traffic. I can only encourage you to play around with the new --max-unused option. Especially if set to unlimited, the pruning will not need to read any data pack file from your repository. Note that there is now also a --dry-run option which allows you to test different values for the options before actually running the prune.

Yeah, that was pretty stupid from my side: Of course I cannot do it against deep glacier directly, it should be IA
I think I will perform small trials and see how it goes.
Maybe one month I can perform a backup and incur in 5 bucks of cost, but with the frequency I sync to the cloud, I don’t think I’m going to be doing so every single day, perhaps once a month or so…
What I was afraid of was those api calls.
With S3 you pay when you call to put and get and, for the stored volume (which is the lowest and more controllable variable) and for the download volume.
Seeing how restic creates a lot of small files and a lot of folders… is there a way to control this and maybe, make restic handle his files in less quantity but bigger size? I ask thins from my standpoint, where I don’t really know the inner working of restic, don’t take it as a critic, I’m sure there’s a reason for it to be this way.

Thanks for the suggestion.

If I understand this and the --repack-cacheable-only options correctly (I may not) they essentially allow me to tell restic to reduce data traffic at the cost of allowing the quantity of unused data in the repo to grow.

I’m not sure I see the long-term cost benefit of a non-zero --max-unused value…I can see that prune traffic will be reduced temporarily until the repo reaches the --max-unused limit, but after that restic prune will have to use just as much traffic as before in order to maintain the repo within the --max-unused limit. At that point, I’ll be paying for roughly the same level of traffic as before but for unneeded data storage as well. What am I missing?

This is under development, see https://github.com/restic/restic/pull/2750. But I cannot say if or when it will be included in an official release.

  1. There are lots of other improvements in the latest betas - like prune no longer needs to access each data pack file (from /data) twice - which improve speed and reduce traffic a lot, even when using --max-unused=0.

  2. You specify an upper limit. prune will always remove files under /data that are completely unused - independently from your choice of this upper limit . You can even specify a % value, so the limit effectively grows with your repo size. The default value is 5%.

  3. prune uses this upper limit to optimize the data pack files that are downloaded. Just imagine you have a file under /data with a size of 4MB where only 200 bytes are unused. prune from 0.11.0 will download and repack this. In beta with --max-unused > 0 there is a pretty high chance that this data pack file will never be touched.
    In this way, I think that the new algorithm is able to reduce the traffic so much that the tradeoff of using a bit more storage can save money on the long time. However, the “sweet spot” of course depends on your individual costs of your storage backend, so you can optimize the parameter. The default value of 5% however seem pretty much reasonable to me, but is might be interesting for other users with the same backend which value is the optimal one…

See also https://github.com/restic/restic/pull/2718 and the references therein.

@beer4duke did a proof of concept to list the required packs with a --dry-run switch. Obviously the UX isn’t perfect, but it was easy enough for me to use on large tape storage with restic. Just the list of packs alone was enough to work with as a two step process.

GitHub - beer4duke/restic: Fast, secure, efficient backup program is the fork. Internally we have an evolution of this we are using, and a colleague has said he will push it to github soon.

I work with a lot of researchers and educators that want a simple to use but capable backup product, and they’re happy to use cold storage knowing there’s a delay for retrieval. Restic is well liked, and adding the capability of using storage classes and other cold storage methods would be a boon.

Would be great if your PR would not only print out the pack files which are to be used within restore, but also return it from the restorer codebase. We might then get some more-or-less uniform way how the restic commands could deal with such a case. The actions could be:

  1. return the list of needed pack files in JSON format
  2. try to access all of those pack files without error checking (for some cold storages this is enough to do the warm-up)
  3. run user-defined commands for all these pack files directly from restic (don’t know if we really need this, as this can be also scripted by 1.)
  4. Do the warm-up directly for specific backends (S3 Glacier?)

See

where 1. and 2. are implemented for the prune command. The warm-up code from cmd/restic/warmup.go from this PR could be also used to implement 2. for the restore command.

That’s my point: Seeing how a cold storage is just a “last resort option” for DR and should be used only in cases of total disaster (I compare it to having my house burn completely) it’s unlikely that you will have to resort to this kind of storage ever.
From the 3,2,1 rule of backups, this would be my “1 backup offsite” part.

I’m using Backblaze B2 storage with restic. There’s 3 separate repos totaling of ~5TB. I also run prune/forget once a month. Cost is 20 - 30 EUR per month.

Wasn’t B2 60/year? Maybe I’m wrong

Backblaze’s backup product is different from B2, which is a cloud storage service similar to S3. Pricing is $0.005 per GB-month of storage and $0.01 per GB of egress traffic, no minimums and no commitment.

Yes, 60/year is their “set it up and forget it” backup solution, which has nothing to do with restic. You just install their software which backs things up. Backblaze B2 is their cloud storage. But maybe Backblaze backup solution is exactly for you and you don’t need to use restic at all.

Until Restic explicitly supports Glacier / Deep storage I suggest not using it. I have my data in IA class, with indexes and such in standard class, as with IA class you pay per access. I’m not sure if the index and such need to be in S3 standard class, but they’re small so it doesn’t matter.

AWS storage isn’t especially cheap. You’d probably be better off with B2. I use S3 because I’m an AWS specialist, but if I wanted to store a lot of data with Restic I’d still go with B2. I do keep data in S3 deep archive class, but not with Restic, I keep what’s effectively incremental backups updated every six months, with frequently modified data backed up daily using Restic and other methods to S3.

Thanks for the comments guys.
I finally went with B2. All my restic repo is already on that service.
What I really do is from local, i do restic to a repo via sftp to a raspberry pi, all inside my home.
Then, I use rclone to upload the repo to B2.
I tested mounting such repo first with rclone, then with restic and was able to restore a single file, so everything looks a-OK.
Definitely cheaper than S3, now that I think about it, and the best is that there are no hidden costs WYSIWYG

1 Like

I’ve been using GDA with restic for almost a year. For my workload, this approach dramatically reduced my storage costs, but I recognize that a restore will be more expensive and some workloads might not achieve all the storage cost savings I did.

This is my approach:

  • I use restic to backup to a locally-hosted repository every night.
  • After the backup is complete, I use aws s3 sync to sync the files to an S3 bucket. I use the --delete flag to remove files from the bucket if they disappear from my local repo. This is very efficient from a bandwidth and cost perspective.
  • The s3 bucket has lifecycle rules that move files in the data/ directory to GDA after 10 days. This delay reduces GDA accesses and deletions for files that live in my repo only briefly. (The migration does not impact future sync operations, nor do those operations trigger restores of GDA data.)
  • I prune and check my local repo weekly. These actions are automatically synced to the cloud.

Pros of this approach:

  • Very low cost cloud storage
  • Very reliable backup process
  • Very high data durability (w/copies locally and in the cloud)
  • Prunes and restores are fast and cheap because they are performed against a local repo

Cons:

  • Requires local storage devices large enough to hold the repository
  • If the local repository is lost, restoring from GDA will be expensive
  • If the local repository is corrupted (e.g. ransomware, restic bugs), those corruptions will be synced to the cloud.
1 Like