S3 and Glacier for long term archives

Monty0 · December 18, 2017, 7:03pm

I’m using restic to backup our log DB (graylyog/elasticsearch/mongo). I would like to have 3 months of daily backups available on S3, and two years of daily backups available via Glacier.

I can create a policy for S3 that will automatically move files older than 3 months to Glacier. All Glacier files are available via the S3 interface, so restic won’t know the difference, it’s just that the ones in Glacier will be slower to access and have different pricing.

I don’t know enough about restic’s repo file format to know if it would make sense to just set this policy on S3 and let restic do its thing. I’m guessing it would access older files during pruning and such. So assuming I can’t just set the policy and forget about it, I’m thinking I need to do something like:

create a new bucket/repo
log to it for 3 months
close it down & open a new repo for the next quarter

Or would it make more sense to just copy the repo via the S3 tools, then run a final prune command on that repo & archive it, but always keep the main repo in place?

When I need to restore the logs to a particular day, I just choose the appropriate repo and use restic restore on a new volume before spinning up a temporary graylog instance.

Thoughts much appreciated,

Monty

rawtaz · December 18, 2017, 7:20pm

Why not just run two backups; One to S3 and one to Glacier? Then you can do as you wish with each one of them.

Monty0 · December 18, 2017, 7:30pm

The thing is that Glacier is S3. You activate Glacier by setting a policy that says to move files from S3 to Glacier when they are a certain age, but everything is still accessed through S3.

But I could have two backups simultaneously to S3, with one being the main permanent one and the second being the archival one. And at the end of the quarter I would close the archival one, prune it properly, and open a new archive for the next quarter.

rawtaz · December 18, 2017, 7:37pm

Im failing to see what you’re trying to accomplish with this setup. Either way, since potentially all snapshots can share a piece of every blob in the repository, there’s not much in terms of moving three months old files to Glacier. You’d need very specific and intricate knowledge of what’s stored in the snapshots to even begin to make such guesses

I like the idea of separating the backups, so in the unlikely event that something goes amiss or wrong in one of the, you still have the other one intact.

fd0 · December 18, 2017, 7:55pm

We did not get any reports from people trying to do what you’ve got planned, but as far as I can tell it might work with restic (>= 0.8.0), which has a local cache of important files so that they are not accessed via S3. It will just take a very long time to access any files, since the latency of requesting a file on Glacier is between 1 and 5 minutes.

It’d be interesting to try this for a while (maybe configured with a week) and report back if it is usable at all. With the local cache it might work… and maybe it’s even usable.

Monty0 · December 18, 2017, 8:13pm

Interesting. Any way to estimate how many files restic would create? It may be that I would want to first access every file in that particular archive to prime the retrieval of them all, wait for a while, and then call restic. Or even copy them all to a local volume and just point restic there once it has all been unarchived?

fd0 · December 18, 2017, 8:21pm

The pack file size is set to be 4-16MiB, then you can calculate the rough number of files.

That’s probably the better approach.

Monty0 · December 18, 2017, 8:49pm

Hmm, I guess by copying to a local volume I would have to copy everything, whereas if I’m asking for a specific time (say a year ago) then restic would know which files are needed and not have to access the newer files.

In our case, having to wait a day or two for access to such an old snapshot is acceptable. I’ll have to do some tests and calculations to make sure I’m not off by orders of magnitude.