Is a safe `forget` command possible when the append-only client was compromised?

luc · November 16, 2021, 11:26pm

Ransomware encrypts your files.
You notice and want to restore a backup. Luckily, you thought ahead and have this append-only server, so the attacker could not delete anything.
Unfortunately, because disk space is not infinite, you had your server set to run a forget+prune every month, and the attacker (in true Kerckhoff’s style: knowing your system but not your keys) created fake snapshots mere minutes before your cron job ran, causing your forget rule to think the real backups were stale and should be removed!

The threat model warns for some trick using the forget command:

Note: It is not recommended to ever run forget automatically for an append-only backup to which a potentially compromised host has access because an attacker using fake snapshots could cause forget to remove correct snapshots.

—References — restic 0.16.3 documentation

One approach to solving this problem would be to simply include --keep-within 14d to make an attacker wait at least 14 days between compromising your system and successfully deleting your backups. I use my computer more frequently than that, so I would notice if my files were encrypted in the meantime. However, the documentation makes short work of that attempted mitigation:

--keep-within duration keep all snapshots which have been made within the duration of the latest snapshot
[…]

All calendar related --keep-* options work on the natural time boundaries and not relative to when you run the forget command.

—Removing backup snapshots — restic 0.16.3 documentation

Not sure what a natural time boundary is (are there unnatural time boundaries?) but the rest is pretty unambiguous: the attacker can add snapshots that were supposedly made in the year 2300 and forget would happily remove all your other snapshots as being too ancient. Or if you try to keep N weekly, they would add >N bogus historic ones.

Is there any rule possible to keep snapshots for at least, let’s say, 14 days after they were added? Or does one need to write a custom script that does sanity checks on snapshot list (while the server is off to prevent race conditions) to use the append-only feature effectively?

And is this the attack meant by the threat model? I checked the diff, the pull request thread and review comments, and the ticket that triggered adding a threat model in the first place, but no attack is mentioned concretely. After a while, I figured this must be it, but since I didn’t realize that append-only is not safe with forget until the threat model hinted at it, there might be more I don’t realize yet.

(mods: please remove the `backticks` around those links. The stupid forum software doesn’t allow me to post more than two links, even if it’s all the same domain. Also, it sends spam by default to try and get me onto the forum (“Activity Summary”). I’m signed up for way too many discourse forums out there now, each with a new login, for this to be a convenient default setting…)

rawtaz · November 16, 2021, 11:44pm

This is discussed here: Warn about future-dated snapshots on restic check · Issue #3498 · restic/restic · GitHub

luc · November 16, 2021, 11:59pm

I’m not sure I understand. There isn’t any discussion in that link? Just the initial post that seems to conclude the same as me (albeit in more general terms, no concrete example as above). It links to a pull request, saying “As discussed [there]”, but then there is also no discussion there either: it just implements additional options with the same flaw as existing options. Aha, but this pull request has a link to another ticket (3414)! But it’s also just about those options, not at all about the forget command being dangerous. This ticket links to a forum post but that also has no discussion about it being unsafe/dangerous/confusing (and a few other keywords I ctrl+f’d for).

Thanks for the link as this indeed seems to request a change similar to the one I think we should probably make (there is always more to read, isn’t there ), but it does not really answer the question of whether there already exists a possible safe forget rule (one just has to be careful in designing it) and if this is really the attack that the threat model speak of or if there is more to consider when designing a solution (be it as part of mainline restic or a custom script).

torfason · November 17, 2021, 11:07am

Thank you for the analysis @luc. I’m the one who implemented the --keep-*-within and submitted the issue you referred to. At the time, when discussing these new features we realised that incorrectly dated snapshots could be dangerous, both with these and the older keep-policies.

Hoewver, we had in mind only snapshots with inadvertent incorrect dates. As you point out, a sophisticated attacker could exploit this in certain circumstances to make it even more dangerous.

My takeaways:

This indicates even more strongly that having automated forget+prune policies on append-only servers is not safe.
Perhaps it is possible to address this by going beyond the solution suggested in the GitHub issue, but I am not sure exactly what would be a good way to do so.
It’s worth mentioning this analysis in the issue, so that it can at least be considered while implementing that feature.

luc · November 17, 2021, 10:19pm

Hey, happy coincidence (or is it? ^^) that you noticed this topic.

I actually wrote a comment for the ticket on github last night, but I wasn’t sure about it and it was 2am so I decided to leave it for today. Just made some minor edits and added it to the ticket, so it is not actually in response to what you wrote in case it might seem that way. Curious what others think of it, if I may be overlooking something (I’m not very familiar with the project internals so I very well might be).

rawtaz · November 18, 2021, 1:05pm

All of this boils down to that one can set a custom date/time on snapshots. As long as that feature exists, which it will continue to do, then it is what it is. Calling the forget or prune commands dangerous isn’t really fair I think, they just do what they should be doing and they do it right. The attack you describe is indeed real though.

Anyway, changing e.g. --append-only to not support setting a custom timestamp is probably not a practical solution - first of all it only has an effect in rest-server and not the other backends, and second the rest-server would need to be able to know what the restic client is doing in the snapshot files, which it doesn’t (it’s just shuffling files back and forth for the restic client).

We already discussed the plan to warn the user if a snapshot is too far into the future. A further option would be to introduce a way where the user can disable forgetting of snapshots that are beyond a certain datetime difference or similar - that is, so that restic refuses to forget such snapshots entirely.

Of course some people would probably not realize that they should perhaps apply that feature to prevent “accidental” forgetting of snapshots, but at the same time those that actually do have such a targeted threat model would arguably look at the options of the forget command before running it. So honestly I don’t think that would be a problem, and I think making this feature the default could potentially be too backwards breaking. But it’s worth considering anyway (having it disabled by default and allowing it explicitly, I mean).

torfason · November 18, 2021, 3:45pm

I agree that a fix is complicated, so for now, the key takeaway is to be careful with forgets.

W.r.t. changing --append-only only having an effect in rest-server, it seems to me like that is the only solution that explicitly* claims to support a threat model where the machine running restic may be compromised, as long as the machine running rest-server is not.

(*) I say explicitly because an alternative way to try to isolate would be for machines to back up to a shared location (medium-secure), while a highly-secure machine copies from the medium-secure location, but the highlyssecure machine makes sure to never delete any files from its copy of the repo. Here again, if an attacker managed to sneak in malicious fake snapshots (adapted to the specific forget policy) they could trick the highly-secure machine into forgetting all the real snapshots and then prune all the data into oblivion. It’s a tricky problem …

MichaelEischer · November 18, 2021, 8:05pm

forget --keep-within $time will ignore snapshots which have a date in the future when determining the latest snapshot. That is the future-dated snapshot will not be used as reference snapshot. Therefore that option will prevent all snapshots between now and now-$time (and maybe even older ones) from being deleted. Future-dated snapshots are also kept.

We’d need a way to attach a trusted timestamp to snapshots, and that’s complicated. An alternative would be to implement something along the lines of the S3 Object Lock feature, where a client is able to prevent files from being deleted for some time. By locking a snapshot (and with proper integration in restic) that would prevent an attacker from deleting the correct snapshots.

luc · November 20, 2021, 1:04am

That sounds like a solution to the question! Thanks

MichaelEischer · November 20, 2021, 12:24pm

@luc I think I should add that my explanation of the --keep-within behavior only applies to restic since version 0.12.1. All earlier versions will fall into the trap and delete too many snapshots.