Does restic cache contain non-encrypted backed up data?

Oui · October 18, 2020, 9:19am

On my system, the directory I want to back up using restic contains important data which is stored encrypted. On the other hand, my home directory is unencrypted. Thus, I don’t want the important data to end up in my home directory unencrypted. By default, restic’s cache directory is ~/.cache/restic. What does restic store there? Encrypted or unencrypted backed up data? Unfortunately, the docs don’t clarify this.

fd0 · October 18, 2020, 1:59pm

Welcome to the forum!

It stores copies of files present in the backend (on the server). Everything stored in the cache directory is encrypted (as of restic <= 0.10.0). Please be aware that this was done to reduce code complexity and may change, as restic’s threat model defines that the local host is trusted.

Here’s the design doc. It says:

Snapshot, Data and Index files are cached in the sub-directories snapshots, data and index, as read from the repository.

Is there anything we can do to clarify this further?

doscott · October 18, 2020, 2:43pm

Are you running restic as “you” or as root? If you are running as root, the cache directory is in the root folder, and this is probably encrypted.

Oui · October 19, 2020, 9:55am

It doesn’t matter. /root is not encrypted by default in GNU Linux.

Oui · October 19, 2020, 10:07am

Thanks for your clarification.

Please be aware that this was done to reduce code complexity and may change, as restic’s threat model defines that the local host is trusted.

I infer from this that a future version of restic might start using other unencrypted persistent (in contrast to RAM) parts of my file system to store sensitive data. Well, that’s unfortunate for me.

Snapshot, Data and Index files are cached in the sub-directories snapshots, data and index, as read from the repository.
Is there anything we can do to clarify this further?

Now that I’ve read this thread, I interpret the quoted sentence as saying that in the cache directory, restic stores files called “snapshots”, “data” and “index”, and that their exact copies are stored in the remote repository. Since I know restic doesn’t store unencrypted sensitive data in the remote repository, now I know it doesn’t do that in the cache directory either.

However, before having read your answer, I wouldn’t be sure how to interpret that sentence. It might help if you rewrite it more clearly and also make the words “Snapshot”, “Data”, and “Index” hyperlinks to their definitions. Also, I think “as read from the repository” contradicts how restic actually works, but I am not sure. I would guess restic backup generates Snapshot, Data, and Index files locally and then uploads them to the remote repo. But this part says it reads them from the remote repo and puts into cache.

fd0 · October 24, 2020, 7:52am

This may indeed happen, the threat model permits it. We’ll try to avoid it though.

Please be aware that it’s also possible to disable the cache (which will make things slower) or configure its location. You could, for example, create an encrypted file system with a random password, use that as the cache directory for a run of restic, then remove it again (and forget the password).

Correct, that’s how it works right now.

Ah, interesting! Indeed, restic locally creates the files in a temporary directory, then they are directly uploaded to the backend. For some files (index, snapshots) it stores a copy in the cache so it doesn’t have to download the files again next time it is run.

alexweiss · October 24, 2020, 8:03am

Isn’t using a new (and empty) cache dir for each run just equal to using --no-cache? I actually don’t remember where restic would use a repo file twice within one run. AFAIK the cache only speeds up future runs of restic.

EDIT: I’m asking because we were having that discussion about eventually consistent backends within the rework of prune. I would feel that restic accessing a file twice might suffer from similar problems. This is why I think this situation should be avoided…

EDIT2: Thinking a bit, there might be some commands that need to read tree packs more than once, maybe restic find? So let’s state, that at least for the backup command, a local cache is only needed to access files from the last backup runs.

fd0 · October 30, 2020, 9:05am

No, it’s different: With --no-cache restic won’t use a cache, so everything (even tiny bits of data) are loaded directly from the backend. For a local repository that mostly won’t hurt performance, but for a remote repository using a new cache for each run will result in much better performance than --no-cache. Is that clearer?

That’s not the case. Sometimes, data is used multiple times due to the way the deduplication algorithm works (e.g. a tree object that’s used several times in a backup), so that’s where the cache really helps. I’m not sure how common this situation in practice is, though, at least not for backup. For restore, I can imagine that there are many data blobs reused between files.

For backup the cache helps a lot for high-latency remote repository (such as B2), where it is quite expensive to fetch a file. Without a cache, restic will request parts of a file stored in the backend multiple times. With the cache, the whole file will be loaded on the first request, and then all future bits needed from that file are loaded from the local copy.

I think you’re right

alexweiss · October 30, 2020, 9:28am

Ah, I didn’t remember, that restic always caches the whole pack file. Then of course it does matter for high-latency backends. Thanks for clarifying!

cdhowie · October 30, 2020, 6:19pm

How much of a difference would you expect a cache to make for a local repository? I’ve never really thought about this, and now that you’ve said this it occurs to me that I might be able to free up quite a bit of disk space if there isn’t much of a penalty for not using a cache with local repositories.

alexweiss · October 31, 2020, 7:12am

If the access to the cache is not faster than to the repository (e.g. if both are on the same physical medium) then you can as well omit the cache…

It the cache is on a faster medium it is very depending on the actual speed difference and your specific use case. And for a local backend, it will also depend much on the OS file system caching…

In general, index files are only read once and then kept in memory, so there shouldn’t be much gain. I also think that snapshot files are in general only read once, but I’m not 100% sure about this.

For tree blobs, it matters how the access pattern to those is. E.g. during backup, restic reads the complete tree of the parent snapshot. So if there is not much change compared to the parent, a faster access to tree blobs will give quite some speed-up. If there is much change, the backup time will be dominated by the upload of the new data…