S3 Deep Glacier, any experience with costs?

JonAnder83 · November 19, 2020, 5:06pm

I’m considering the option of using restic with S3, but I’m afraid of the access costs.
Is it a good idea, from a cost standpoint to use restic at all?
If anyone could share their experience and their bill for insight purposes, that would be really helpful to make up my mind.
I’'d be using it to backup my personal pics and videos, once a month or so, just pushing more and more incremental stuff, never to be (hopefully) download again, as it is my DR solution.
300 gb aprox with an average of new 2 Gb per month or so…
How much does restic need to GET and PUT? How much DL traffic does it generate? and how much change in the stored files?
Regards!

rubiojr · November 19, 2020, 6:20pm

I’ve been using filebase.com, wasabi.com and Backblaze B2 for my repos. I have large (several TiB) in “append-only” mode (never ran prune).
They all work well with Restic and are far simpler and cheaper when it comes to estimating costs and keeping them under control. I like Filebase and Wasabi because they don’t have egress or API costs so I can use my repos as deduplicated storage repositories to consume documents, media files, etc.

underhillian · November 20, 2020, 6:17pm

My understanding is that objects in S3 Glacier and Deep Glacier classes are in general not accessible in real time. This model does not work well with restic, which expects real time access to repository contents.

I learned this the hard way a couple of years ago when I unintentionally set up a rule that automatically transitioned objects to Glacier. Once transitioning began, restic operations on the corresponding repository (understandably, in retrospect) failed completely because it could not access the transitioned objects. Possibly things have changed since then, but that was my experience.

FWIW, restic works fine with objects in S3 Standard Infrequent Access. It’s not as cheap as the Glacier classes, but it is cheaper than Standard and (the last time I checked) competitive with Backblaze B2, which I also use. I transition repo objects to SIA automatically after 60 days and although I do incur some charges associated with early access when I prune they are small, typically a few cents per month.

I’m not sure how to address your questions around cost, as the answers will depend strongly on exactly what you do and how you manage your repositories…a prune operation will generally result in much higher data usage than a backup operation, for example. That said, just to give you an idea, my backup situation is quite similar to yours in terms of size and monthly activity, and I spend about 4 $/month on S3. I run daily backups and a monthly forget/prune. That’s with most of my data having transitioned to SIA; it would of course be more if my data lived in Standard.

There are other factors that you may or may not consider important. One is speed. For me, S3 is much faster (maybe 5x faster) than B2. This may be only my location (the AWS data center I use is much closer to me than than the B2 center), but it is what it is. Another is durability. AWS promises higher durability than B2, and depending on your level of paranoia you might care. (I should emphasize that I use both services and that I’ve never lost a bit with either; it’s just a question of how many 9’s you want to pay for).

alexweiss · November 20, 2020, 7:58pm

As @underhillian correctly pointed out, restic is not (yet) ready to (fully) work with S3 Glacier or Deep Glacier.

However, there are already possible settings that work.

First, about your question about change in stored files: restic never changes any file on your storage backend as the file name eqals the SHA256 of its contents. So to “modify” the content, restic always has to delete the file and save a new one.

Secondly, you have to know that restic saves the content of the files you backup in the /data/ directory of your storage backend. There is also quite some metadata like the directory tree structure which is also saved in /data/ or an index saved in /index/ and so on. Usually the metadata contributes only to a very small fraction of the size the repository needs on your storage backend. Moreover, almost all metadata is locally cached.
So if you only access your repository from a single machine and manage to

only store the /data dir in S3 Deep Glacier
never loose your local repository cache

basicially all restic command that don’t access the contents of files stored in your repository will work. They will only need to download minimal parts and basically do some LIST on your storage dirs. The PUTs obiously depend on what your are doing, especially how much data you add to your repository.

Commands that are working in this scenario are:

backup
check without --read-data
ls, diff, find, stats
forget and prune with --repack-cacheable-only or --max-unused=unlimited (not included in 0.11.0 and before, but in the latest betas)
snapshots, tag
copy if this is the destination repository

Commands that are not working in this scenario are:

restore, mount
cat, dump,
check --read-data
rebuild-index
copy if this is the source repository

If you need any of these commands you first must restore the needed files from Deep Glacier to standard S3 in order to access them. Unfortunately, restic doesn’t tell you which ones, so you will most likely end up with restoring everything to standard S3. The same applies if you loose your local cache to get access to the files needed to rebuild the cache.

I plan to work on enhancing the support of storages like S3, but realistically this will take at least a year before we get “acceptable” support.

alexweiss · November 20, 2020, 8:14pm

With the latest beta, this is no longer true. I also suspect that if you use this, it will decrease your bill as it need much less traffic. I can only encourage you to play around with the new --max-unused option. Especially if set to unlimited, the pruning will not need to read any data pack file from your repository. Note that there is now also a --dry-run option which allows you to test different values for the options before actually running the prune.

JonAnder83 · November 20, 2020, 10:25pm

Yeah, that was pretty stupid from my side: Of course I cannot do it against deep glacier directly, it should be IA
I think I will perform small trials and see how it goes.
Maybe one month I can perform a backup and incur in 5 bucks of cost, but with the frequency I sync to the cloud, I don’t think I’m going to be doing so every single day, perhaps once a month or so…
What I was afraid of was those api calls.
With S3 you pay when you call to put and get and, for the stored volume (which is the lowest and more controllable variable) and for the download volume.
Seeing how restic creates a lot of small files and a lot of folders… is there a way to control this and maybe, make restic handle his files in less quantity but bigger size? I ask thins from my standpoint, where I don’t really know the inner working of restic, don’t take it as a critic, I’m sure there’s a reason for it to be this way.

underhillian · November 20, 2020, 11:23pm

Thanks for the suggestion.

If I understand this and the --repack-cacheable-only options correctly (I may not) they essentially allow me to tell restic to reduce data traffic at the cost of allowing the quantity of unused data in the repo to grow.

I’m not sure I see the long-term cost benefit of a non-zero --max-unused value…I can see that prune traffic will be reduced temporarily until the repo reaches the --max-unused limit, but after that restic prune will have to use just as much traffic as before in order to maintain the repo within the --max-unused limit. At that point, I’ll be paying for roughly the same level of traffic as before but for unneeded data storage as well. What am I missing?

alexweiss · November 21, 2020, 1:02am

This is under development, see https://github.com/restic/restic/pull/2750. But I cannot say if or when it will be included in an official release.

alexweiss · November 21, 2020, 1:17am

There are lots of other improvements in the latest betas - like prune no longer needs to access each data pack file (from /data) twice - which improve speed and reduce traffic a lot, even when using --max-unused=0.
You specify an upper limit. prune will always remove files under /data that are completely unused - independently from your choice of this upper limit . You can even specify a % value, so the limit effectively grows with your repo size. The default value is 5%.
prune uses this upper limit to optimize the data pack files that are downloaded. Just imagine you have a file under /data with a size of 4MB where only 200 bytes are unused. prune from 0.11.0 will download and repack this. In beta with --max-unused > 0 there is a pretty high chance that this data pack file will never be touched.
In this way, I think that the new algorithm is able to reduce the traffic so much that the tradeoff of using a bit more storage can save money on the long time. However, the “sweet spot” of course depends on your individual costs of your storage backend, so you can optimize the parameter. The default value of 5% however seem pretty much reasonable to me, but is might be interesting for other users with the same backend which value is the optimal one…

See also https://github.com/restic/restic/pull/2718 and the references therein.

davidjericho · November 24, 2020, 10:40pm

@beer4duke did a proof of concept to list the required packs with a --dry-run switch. Obviously the UX isn’t perfect, but it was easy enough for me to use on large tape storage with restic. Just the list of packs alone was enough to work with as a two step process.

GitHub - beer4duke/restic: Fast, secure, efficient backup program is the fork. Internally we have an evolution of this we are using, and a colleague has said he will push it to github soon.

I work with a lot of researchers and educators that want a simple to use but capable backup product, and they’re happy to use cold storage knowing there’s a delay for retrieval. Restic is well liked, and adding the capability of using storage classes and other cold storage methods would be a boon.

alexweiss · November 25, 2020, 3:20am

Would be great if your PR would not only print out the pack files which are to be used within restore, but also return it from the restorer codebase. We might then get some more-or-less uniform way how the restic commands could deal with such a case. The actions could be:

return the list of needed pack files in JSON format
try to access all of those pack files without error checking (for some cold storages this is enough to do the warm-up)
run user-defined commands for all these pack files directly from restic (don’t know if we really need this, as this can be also scripted by 1.)
Do the warm-up directly for specific backends (S3 Glacier?)

See

github.com/restic/restic

prune/rebuild-index: Add warmup possibilities

restic:master ← aawsome:prune-warmup

opened 01:05PM - 07 Aug 20 UTC

aawsome

+105 -6

What does this PR change? What problem does it solve? -------------------------…---------------------------- Adds the possibility to access all packs which need a repack during `prune` and `rebuild.index` This allows some cold storages to warm up these packs such that they are all accessable. This warm-up can also be done within a dry run such that the needed packs are available for the next `prune` run. Also the `--json` option is now implemented for prune dry-run. This allows to build custom warm-up processes if the pure access to files doesn't do the warmup. This can be easily extended to `restore`, see #2796. Was the change discussed in an issue or in the forum before? ------------------------------------------------------------ The idea comes from #2796. For cold storage discussions see also #2504 I didn't see a discussion about pruning cold storage, but this PR proved to be very useful for pruning my OVH Cloud Archive repositories. Checklist --------- - [x] I have read the [Contribution Guidelines](https://github.com/restic/restic/blob/master/CONTRIBUTING.md#providing-patches) - [x] I have enabled [maintainer edits for this PR](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/allowing-changes-to-a-pull-request-branch-created-from-a-fork) - [ ] I have added tests for all changes in this PR - [x] I have added documentation for the changes (in the manual) - [x] There's a new file in `changelog/unreleased/` that describes the changes for our users (template [here](https://github.com/restic/restic/blob/master/changelog/TEMPLATE)) - [x] I have run `gofmt` on the code in all commits - [x] All commit messages are formatted in the same style as [the other commits in the repo](https://github.com/restic/restic/blob/master/CONTRIBUTING.md#git-commits) - [x] I'm done, this Pull Request is ready for review

where 1. and 2. are implemented for the prune command. The warm-up code from cmd/restic/warmup.go from this PR could be also used to implement 2. for the restore command.

JonAnder83 · November 25, 2020, 5:04pm

That’s my point: Seeing how a cold storage is just a “last resort option” for DR and should be used only in cases of total disaster (I compare it to having my house burn completely) it’s unlikely that you will have to resort to this kind of storage ever.
From the 3,2,1 rule of backups, this would be my “1 backup offsite” part.

jarm0 · November 25, 2020, 6:46pm

I’m using Backblaze B2 storage with restic. There’s 3 separate repos totaling of ~5TB. I also run prune/forget once a month. Cost is 20 - 30 EUR per month.

JonAnder83 · November 25, 2020, 7:18pm

Wasn’t B2 60/year? Maybe I’m wrong

cdhowie · November 25, 2020, 7:19pm

Backblaze’s backup product is different from B2, which is a cloud storage service similar to S3. Pricing is $0.005 per GB-month of storage and $0.01 per GB of egress traffic, no minimums and no commitment.

jarm0 · November 25, 2020, 8:20pm

Yes, 60/year is their “set it up and forget it” backup solution, which has nothing to do with restic. You just install their software which backs things up. Backblaze B2 is their cloud storage. But maybe Backblaze backup solution is exactly for you and you don’t need to use restic at all.

tomwaldnz · November 27, 2020, 7:34am

Until Restic explicitly supports Glacier / Deep storage I suggest not using it. I have my data in IA class, with indexes and such in standard class, as with IA class you pay per access. I’m not sure if the index and such need to be in S3 standard class, but they’re small so it doesn’t matter.

AWS storage isn’t especially cheap. You’d probably be better off with B2. I use S3 because I’m an AWS specialist, but if I wanted to store a lot of data with Restic I’d still go with B2. I do keep data in S3 deep archive class, but not with Restic, I keep what’s effectively incremental backups updated every six months, with frequently modified data backed up daily using Restic and other methods to S3.

JonAnder83 · November 27, 2020, 9:42am

Thanks for the comments guys.
I finally went with B2. All my restic repo is already on that service.
What I really do is from local, i do restic to a repo via sftp to a raspberry pi, all inside my home.
Then, I use rclone to upload the repo to B2.
I tested mounting such repo first with rclone, then with restic and was able to restore a single file, so everything looks a-OK.
Definitely cheaper than S3, now that I think about it, and the best is that there are no hidden costs WYSIWYG

David · December 2, 2020, 4:17pm

I’ve been using GDA with restic for almost a year. For my workload, this approach dramatically reduced my storage costs, but I recognize that a restore will be more expensive and some workloads might not achieve all the storage cost savings I did.

This is my approach:

I use restic to backup to a locally-hosted repository every night.
After the backup is complete, I use aws s3 sync to sync the files to an S3 bucket. I use the --delete flag to remove files from the bucket if they disappear from my local repo. This is very efficient from a bandwidth and cost perspective.
The s3 bucket has lifecycle rules that move files in the data/ directory to GDA after 10 days. This delay reduces GDA accesses and deletions for files that live in my repo only briefly. (The migration does not impact future sync operations, nor do those operations trigger restores of GDA data.)
I prune and check my local repo weekly. These actions are automatically synced to the cloud.

Pros of this approach:

Very low cost cloud storage
Very reliable backup process
Very high data durability (w/copies locally and in the cloud)
Prunes and restores are fast and cheap because they are performed against a local repo

Cons:

Requires local storage devices large enough to hold the repository
If the local repository is lost, restoring from GDA will be expensive
If the local repository is corrupted (e.g. ransomware, restic bugs), those corruptions will be synced to the cloud.