Caching in restic - large retrieve times on some cloud storages

alexweiss · November 29, 2019, 8:47am

Hi there,

I’m just starting to try out restic as it seem like a perfect solution for my personal backups.

However I have an issue about my cloud storage and wonder why the caching in restic is not solving the problem.

I’m using a storage which is cheap but has a really bad SLA on retrive times. In my case it’s OVH Cloud Archive OVH Cloud Archive but it seems that also AWS S3 Glacier has similiar issues. They guarantee that you get back your objects, but you may have to wait up to 12 hours to get a single object.

Now, about caching:
It seems to me that restic caches snapshot files, index files and data files corresponding to metadata. This is great and speeds up the usual backup quite a lot. However, the config file and the key files are not cached but always retrieved from the storage. Of course we are not talking about big files and with usual storage backends this shouldn’t be a performance issue. However, with OVH Cloud Archive backup regularily times out when the config file and the used key file have large retrieve times…

So, is there a reason why the config-file and the key files are excluded from the cache?
Or the other way round: Is there some reason which prevents adding config-file and key-files to the cache?

Cheers,
Alex

ifedorenko · November 29, 2019, 1:32pm

Restic assumes the same repository can be accessed by multiple systems concurrently and does not cache the config and key files to avoid data corruption. For example, if repository config changes, using cached config file during backup may result in invalid/unusable files written to the repository. And lock files cannot be cached at all.

rawtaz · November 29, 2019, 2:01pm

@alexweiss Why do you not keep your backups on a storage that doesn’t have a latency of 12 hours? Is it really that important to save a few bucks, or what’s the point of the current approach?

It would be cool (but IMO hardly a need to have) if restic could store config and locks in one place and the actual data and other things in another place

alexweiss · November 29, 2019, 4:57pm

Hi,
thank you very much for your answers!
@ifedorenko Thanks for the explanation!
However, I still do not see why keys are not cached as the filename is the SHA256 hash and therefore changed keys get different filenames. Or is the trouble that you can’t remove keys if they are still cached somewhere?
According to Repo design I don’t see how the config file would change for a existing repo. ATM only the chunkger polynomial is saved and changing this would kill the deduplication…?!?

@rawtaz The latency is the trade-off for cheap storage We are taking about a factor of about 5 for storage costs. And the cost depends on the size. So the saving depends on the size and might be some bucks for a small size and even quite some bucks for a bigger size
In my case I would like to save the money and the trade-off having a 12 hour latency for restoring would be ok. However not ok having a 12 hour latency for backing up…

Cheers, Alex

764287 · November 29, 2019, 5:55pm

OVH Cloud Archive charges for incoming and outgoing traffic. Depending on your use case and the amounts of data this can add up really quickly.

IMHO OVH Cloud Archive is suitable for a safety copy of a restic repository but not for a primary repository which is accessed frequently. Some of the reasons why such a storage type is not suitable for restic have been discussed in this thread about S3 Glacier Deep Archive.

ifedorenko · November 30, 2019, 2:52am

I don’t understand enough about restic key usage to comment why they are not cached, but config file contains version attribute, which I believe is meant for forward compatibility should repository format change.

David · December 3, 2019, 3:16pm

@alexweiss Here’s my approach to using Amazon’s ultra-low-cost Amazon Glacier Deep Archive while maintaining reasonable performance in most situations.

I have a nightly backup process that performs a backup to low-cost local storage nightly, and performs a weekly prune/forget. Every day after that backup, my backup process then performs an aws s3 sync to synchronize my local repository to AWS S3. This is a delta-based synchronization, so it runs very quickly, finding and copying changed files only.

My S3 bucket has lifecycle rules defined on it, so that any files in the data/ directory are migrated to AWS Glacier Deep Archive after seven days. (I don’t bother migrating the index, keys or snapshots to Glacier… I just keep them in S3.)

This has been very effective for me: Costs are far lower than my previous provider (wasabi.com), and restored are performed from local storage so I get great performance. I expect that I will never need to retrieve from Glacier unless I experience a catastrophic failure onsite (of both my primary storage and my backup storage).

alexweiss · December 3, 2019, 7:54pm

@David Thanks for sharing your solution!
Actually the idea of using S3 lifecycle rules is very charming - in your case you could even use S3 for normal backup as all regularily used files are either cached by restic or stay in normal S3 storage I see the trouble with prune but maybe pruning is not first priority for real low-cost-storage…

However I don’t like the idea of using a local repository and syncing it to a remote storage. By my opinion this is kind of equivalent of using a cache and caching everything . I’d prefer the caching option because

ideally you can finetune which parts you really need cached (and hence you can save local storage if you need to)
the cache is a (partial) copy of the (remote) repository which is written independently and after the (remote) repository. That guarantees that your (remote) repository stays in good shape in some wired situations. I already had the situation that a broken USB Hub destroyed my backup data but didn’t prevent the broken backup from syncing to a remote storage ouch!

alexweiss · December 3, 2019, 8:24pm

About the caching implementation I just digged a little bit into the code and did some debugging and finally found some answers to my original questions:

The cache is organized by the repo ID
Hence the key files and config files can’t be cached as they are needed to determine the repo ID
standard operation for e.g. lock files, key files, index files, snapshot files is to get a list of all files and then access the files by the SHA256 ID. The list is done in the real backend (as you never know if the cache is complete) and then read all or the needed files. The actual reading is either done in the cache or (if not present or Type not cached) in the backend. BTW: most ultra-cheap storages (S3 glacier, OVH Cloud Archive) don’t have any “latency” issue with listing…

During backup operation, following data is read:

keys and config (from the backend, see above)
locks (from the backend)
snapshot, index, maybe metadata (from cache or backend)

My proposal for improving the situation with hours of read-latency is:

Add an option to organize the cache by the repo string (given e.g. by the -r paramter) to make caching of keys and config possible
Add options which allow choosing what is cached. For the low-cost-storages I have in mind this would include keys, config and lock files. Maybe an option to cache all data files would be also nice to cover the setting of @David…

I will add an issue in github and start working on the implementation.

David · December 3, 2019, 8:29pm

I think of it differently. I don’t consider the local copy a cache, I think of the remote copy as an offsite backup of my repository.

Experts and IT Operations standards overwhelmingly recommend the “3-2-1 strategy” for backup. In this strategy all valuable data should have 3 copies: 2 onsite and 1 offsite). The approach I described fits that strategy, with the local repository as the second onsite copy, and the cloud repository as the offsite copy.

tomwaldnz · December 6, 2019, 2:48am

One downside to Glacier Deep Archive is you pay for at least six months of storage regardless of how long the object is stored, Glacier is three months. That means if you back up data that is deleted within that time you’re over-paying for it. However, given that deep archive is 23X cheaper than standard and 12X cheaper than IA class it probably makes sense. I’m currently putting my data into IA class, but I might move it out to Glacier Deep Archive class.

If you regularly create then delete large amounts of data then you might be best off transitioning to IA class immediately, then moving to Glacier / Deep Archive class after maybe a month. If you create but rarely delete then you might as well push it to Glacier / Deep Archive fairly quickly.

I keep versions of objects in my Restic S3 bucket, to protect against something like a virus overwriting the data. I delete old versions after a year.

cdhowie · December 6, 2019, 3:04am

If you use any storage that is not immediately available (such as Glacier or GDA) then make sure you run every backup with --force since some tree objects in the parent snapshot may be in transitioned (and hence inaccessible) packs.

In fact, many commands would be affected. Here is my understanding of how this would affect each restic command (commands not listed are not affected):

Command	Affect of inaccessible packs
backup	Must run with `--force` since trees in the parent may be inaccessible.
check	Cannot use at all because this needs every pack header.
diff	Will fail when any snapshot references an object that in a transitioned pack.
dump	Will fail when any object along the path to the named file is in a transitioned pack.
find	Will probably fail to run at all or may produce incomplete output.
forget	Cannot use `--prune` for the reasons given below for the prune command.
ls	Will probably fail with most snapshots when an inaccessible tree is needed.
migrate	Will fail if it needs to make any changes to transitioned packs.
mount	Will succeed but I/O errors may result when reading from the mounted virtual filesystem as inaccessible objects are needed.
prune	Cannot use at all because this needs every pack header.
rebuild-index	Cannot use at all because this needs every pack header.
recover	Cannot use at all because this needs every pack header.
restore	Can use only with data that has not been transitioned.
stats	Will likely fail if any snapshot named has transitioned objects.

In particular: a huge caveat to restore is that it can’t give you a list of packs it needs because it must first crawl all the trees to locate all of the objects. It will fail on a specific pack. You can go restore that one pack, then it will fail on another pack, and so on.

The only way to restore (or use most of these commands) with any confidence that anything will work would be to initiate a restore of every transitioned object in the repository. This could be very expensive.

If you can live with the above limitations, you might be able to use Glacier or GDA tiers. For me it’s not even close to worth the hassle.

Note that this all assumes you are only transitioning objects in the data directory/prefix. You must not transition any other objects or restic likely won’t even run against the repository.

alexweiss · December 6, 2019, 8:09pm

If I understood it right, restic does cache alle data files (=packs) that contains a tree object. Or did I get something wrong?

Therefore I guess your list can be changed if you assume that you have a complete cache where all packs containing tree objects are included.

Under that assumption only packs with data blobs are inaccessible and I would change the table as follows:

Command	Affect of inaccessible data-only packs
backup	should fully work
check	should work when using --with-cache
find	should fully work
ls	should fully work
stats	should fully work

Moreover I’m not sure if the prune problem is mainly based on the fact, that prune always does include something like rebuild-index…

Here I also disagree . I think it would be possible the produce that list. However, I agree that the current implementation does assume that everything can be accessed immediately and would probable fail like you describe it.

Again, I disagree
The other objects must be either immediately accessible or in the cache. For snapshots and indices caching is already implemented. For config, keys and locks I’m trying to change the implementation, see my comment above (#9).

cdhowie · December 6, 2019, 10:07pm

The cache is not something that is intended to make restic operate correctly if files are not accessible, it is only for performance IMO. I would strongly caution you against depending on the presence of the cache for restic to operate.

In particular, this technique is likely to break down whenever multiple cache directories are used (such as when backing up to the same repository from multiple machines).

Prune does multiple things. I merely listed the first that will cause it to fail. It must also crawl every tree referenced by any snapshot and rewrite all packs that contain a mixture of used and unused objects. Aside from check --read-data, prune probably reads the most data of any restic command.

alexweiss · December 7, 2019, 7:03am

I fully agree.
Thank you for mentioning this! Of course we have to state that the case of multiple machines using the same repository means you have to get some information directly from the backend because it can’t be known.

This is a good point. In fact considering it that way I believe we have another issue with restic. In my OVH case in fact the backend directly tells that you should wait until you get your data. So if restic would just wait the given amount of time before retrying to get the data from the backend it would operate successfully and correctly. (I don’t know exactly how it is with S3 Glacier)

Of course that correct and successful operation would maybe take a lot of time. Here caching can improve this situation a lot, but only for suitable circumstances (e.g. single machine backup).

So I think we have two issues in restic:

Make restic work correct and successful with backends that have a ultra high latency - and maybe even use the information (if present) the backend gives if you have to wait for content.
Improve the caching in restic so that performance is acceptable also for this special case - at least for regular operations like backup, check, forget and maybe prune.

For the second point I believe restic is not really that far from getting it done.
From your answer I just realized that the first point maybe is even more important. Thanks for pointing this our! I hope I find some time so I can also get some insight to this.

From a general point of view I do believe that supporting ultra cheap but ultra high latency backends would be a big plus for restic. And I also believe that it can be done and I would like to work on this issue.

I hope you appreciate my intentions and the effort trying to work on this issue.
As already written above I think it is best to open an issue and discuss implementation details there.

cdhowie · December 7, 2019, 4:43pm

I think the bigger issue is that the cache could be invalidated at any time. Depending on the cache mechanism is dangerous, IMO. That’s not the purpose of the cache.

It would be better to use Glacier/GDA for an offsite copy – make your backups to a local disk and then rclone to an S3 bucket and use lifecycle policies on that copy. The offsite copy is for recovering from a catastrophic disaster (fire, flood, tornado, power surge, etc.).

alexweiss · December 7, 2019, 7:02pm

Can you please explain what you mean by “invalidated”?

I see it like this: For all files (except config) the filename/ID equals to the SHA-256 of the content. So, if you find a file with the searched ID in your cache, you know you got the file with the right content.

(technically it could also be a hash collusion. However as this is very, very, very, very, very unlikely - and also all of restic’s fundamentals rely on not having hash collusions, I’ll neglect this purely theoretical option)

IMO, the fact that restic relies on SHA-256 and also uses it for the filenames is one of the nicest things in restic.

David · December 11, 2019, 3:24am

As described above, this is exactly what I do and what I recommend, although I use aws s3 sync instead of rclone.

This approach (backup local, sync to aws) supports the 3-2-1 backup architecture.

David · December 11, 2019, 3:26am

Cache invalidation is a standard function of most cache systems. It is designed to support cache coherence.

alexweiss · December 11, 2019, 5:30am

@cdhowie, @David
Thank you for mentioning cache invalidation and contributing to the topic!
My point, however, was not that I do not know what cache invalidation is. I have the feeling, here is a misunderstanding.
Therefore I would like to clarify:

As described in @David links, usual cache systems have a problem: You never know if a file in the cache was changed in the “main” location by parallel processes. This is known as “cache invalidation” and is a real though problem to tackle. The problem, precisely, is how to assure that you don’t get wrong data when reading a file from cache instead of the “main” storage.
In restic however, all files are named by the SHA-256 hash of the content. This implies that if a file is changed, also its file name changes.
Therefore in restic there is no operation which changes the content of present files. If a file need to be changed, a new file is created with new content and a new filename (and the old file is deleted).
This means, if a file is in the cache and also present on the “main” storage then the contents are always identical.
With other words: There is no cache invalidation in restic

BTW: This mechanism gives restic also the ability to work very nicely on mediums which are used in backup-cases. For example “WORM” (write once-read many) or append-only mediums.
And, back to topic, it gives restic the ability to successfully perform some restic operations on “no file-read” mediums, if you ensure that some of the data is in the cache