Ransomware check

David · June 1, 2020, 8:53pm

Hi.

I use restic nightly to backup my most precious data.

I perform a local backup, which I keep onsite, and then I synchronize the repository to an offsite S3 bucket (using ‘aws s3 sync’). This approach complies with the CERT “3-2-1 backup strategy” recommendation: 2 onsite copies, 1 offsite copy.

I worry about ransomware corrupting my repository. Although I perform a restic check weekly on my repository, I would be devastated if I synced a corrupted repository to the cloud. So I have been noodling over the question: How do I know whether my repository has been corrupted before I sync it to the cloud?

Options I’ve considered so far:

1. Check the return code from ‘restic backup’. Assume that if it completes successfully, the repo is OK and sync it.

Concern: Does a successful return code actually indicate that my repository is OK? I suspect not.

2. Perform ‘restic check --read-data’ before I sync to the cloud

Concern: Slow as hell. Really, really slow. I already do a ‘restic check’ regularly to verify that the repo is sane, and the --read-data version must be checking more than I am worrying about to be this slow.

3. Perform ‘sha256sum’ on each file within the repository and compare it to the filename. If they all match, sync.

This feels like a reasonable compromise between providing good assurance that the repo is intact without the enormous time investment of restic check --read-data. It operates about 10 times faster, and should catch any external (non-restic-induced) corruption in key, snapshot, index or data files.

Is this the best approach? Any other recommendations?

Nev · June 2, 2020, 12:59pm

Hi David,

I have a similar setup, and a similar concern. Don’t know if my current solution is the best, but I think it addresses the risk sufficiently for me. I’m interested to see what other people do.

I take advantage of the fact that restic never alters repo files after creation. So when I copy to an off-site s3 bucket, I use rclone in the following way (I would guess similar options are available with other copying methods):

Typically run it with -v copy --immutable. This won’t delete any files on s3 that are missing from the source, and will refuse to overwrite an existing file, throwing an error if the files don’t match (as would be the case if the source files got encrypted with ransomware). Worst-case, post-attack this would mean that any snapshots younger than the previous s3-copy operation would be unusable (which they would be, anyway).
After a prune, run it with -v sync --immutable. This will still error on changed files, but will also delete any files on s3 that were removed from the source during the prune. So check a dry-run output first to be completely reassured that your entire repository isn’t about to be deleted.

For non-malicious corruption risk, I use par2 error-correction files. While I could find such corruption with hash-checking, the par2 approach means I’d have a chance to correct it after the event.

I then trust my fortnightly restic check --read-data will pick up problems in enough time to recover.

rawtaz · June 2, 2020, 1:32pm

You do a full integrity check of the source repository before syncing it, see restic check. This will tell you if there’s corruption, but it will of course not tell you if someone or something e.g. deleted snapshots. For that you have to do other checks that are outside of the scope of restic.

This is about the individual backup run, and has nothing to do with the integrity of the repository in general.

This is a complete integrity check, and is the most reliable way to verify that your repository isn’t corrupted. From the docs:

Just check checks structural consistency and integrity, e.g. snapshots, trees and pack files.
check --read-data also checks integrity of the actual data that you backed up.

This is pretty much what check without --read-data does, but check also looks at the structure of the files in the repository, so I don’t think there’s much value to this approach (comparing hashes yourself).

I would put the repository on ZFS and make use of its zero-cost snapshots after each backup, such that if a repository was corrupted I could restore it to an older version.

sniner · June 2, 2020, 6:27pm

If ZFS is not an option, perhaps Btrfs is one. Btrfs has the same abilities in this respect. Just make sure you take read-only snapshots with btrfs sub snap -r ... to prevent changes.

David · June 2, 2020, 9:15pm

I take advantage of the fact that restic never alters repo files after creation.

Lots of really interesting ideas in this post. Thanks a lot - I need to put my thinking cap on about this.

bdillahu · June 2, 2020, 9:31pm

Just in the “for what it’s worth” department…

I used to do about the same in terms of a local restic run and then an rclone to offsite (I use B2 from Backblaze). Worked fine and I never saw an issue.

I had ongoing minor concerns like you about copying corrupt data and messing up my failsafe… Finally just went with doing two distinct restic runs - one to local, and one to B2 directly. Timing and expiration policies not quite the same (I keep on-site backup a lot longer than cloud for cost reasons), but I figure if the worst happens, hopefully both runs didn’t corrupt things at the same time.

I expected it to be a bad performance hit, but find it really hasn’t been that much. I do pay a little more in transaction fees, but with B2 anyhow, it’s not significant for my use case.

David · June 2, 2020, 9:41pm

Thanks for the feedback. It helped me clarify my thoughts:

Agreed that this is the most reliable approach. But it’s off the table for daily use due to poor performance.

Cannot agree. The sha256sum will certainly tell me if the contents of data files have been modified, which restic check won’t always do unless --read-data is specified. This is precisely why I am considering it (and because it is 8x faster than --read-data)

Perhaps the sweet spot for balancing performance and reliability is as follows: Perform a daily restic check followed by the sha256sum check of each file. If both pass, sync it to the cloud. That will validate both structural integrity and data integrity in about 20% of the time of check --read-data. I could also perform check --read-data less frequently (maybe monthly).

A useful suggestion, but it doesn’t really solve my core goal of identifying whether the repository is healthy before I sync it to the cloud. It does provide me with a nice option for reverting to a previous version of the repository once corruption is found, but it doesn’t help me find it.

In summary: We all agree that restic check --read-data is the most reliable approach. Unfortunately, it’s too slow for me to use on a daily basis. I think a daily restic check plus a sha256sum pass might be functionally close enough and so much faster that it becomes my go-to daily checkup, with a less frequent check --read-data

David · June 2, 2020, 9:48pm

This has occurred to me as well, although I haven’t considered it fully. This strategy has a lot of good qualities. I like the idea of two independent repositories for the same reasons you mention.

Like you, I would expect it to have a substantial cost/performance hit. Currently, I move my data files to Glacier Deep Archive, which doesn’t really work for prunes and checks unless you are syncing from a local repo. I’d need to consider alternative storage solutions.

Thanks for the idea!

rawtaz · June 2, 2020, 9:53pm

You are correct, what I wrote should have been “check with --read-data”. Sorry about that. The main point I wanted to make was that just checking sha256sum won’t tell you that the repo is structurally intact.

rawtaz · June 2, 2020, 9:55pm

I second this very very much. I too run multiple backups to different repositories, and wouldn’t consider syncing one single backup to an offsite repository.

David · June 2, 2020, 10:02pm

Thanks for the further input!

If you don’t mind some followup questions:

How do you choose to manage restic prune and restic check for remote repositories?

Do you perform them from the same host that does the restic backup? (presumably, a low-bandwidth connection to the cloud repository), or do you spin up a machine in the cloud that’s closer to the storage?
How often do you perform a restic prune and restic check on the remote repository?

bdillahu · June 2, 2020, 10:22pm

I perform them from the same machine… low/semi-low bandwidth (200Mbps down/11Mbps up).

I run a ‘restic forget --prune’ weekly. I don’t explicitly run a check, unless I do it manually for some reason.

I was thinking that a forget/prune kind of includes the actions of a check, but I may be wrong there.

I’m also excitedly watching this PR (https://github.com/restic/restic/pull/2718) with the hope that it gets implemented soon, as it holds the promise of making this process much better.

Currently the prune takes WAY too long, but for my environment is still (on the edge of) acceptable.

rawtaz · June 2, 2020, 10:23pm

One repository is my own server hosted in a location I have access to remotely - on this one I run check and prune locally on that server. This repo is stored on ZFS which I also scrub regularly.

The other repository is a VPS in the cloud and there I run check and prune remotely from the client that I back up, because I don’t want to give the host the password to the repository. This repo is stored on Ceph by the hosting provider.

Keeping it simple, just manually when I feel it’s time to do some maintenance (which also means it’s being done when it won’t interfere with any backup runs etc).

dionorgua · June 3, 2020, 9:32am

It very depends on your own threat model regarding repository corruption. restic never deletes file during backup, so repository ‘append-only’. So you can do followed:

After every backup upload only new files to remote mirror (rclone copy --immutable). Optionally do check or sha256 before.
After forget --prune, run check --read-data and if it passes, perform full sync (rclone sync --immutable)

Even if something wrong happens during certain backup and no check/sha256 was done, at least you’ll be able to restore previous snapshots from cloud. --immutable switch protects from syncing “bit rot” like silent file corruptions.

But I agree, that having two different destinations is more robust. The question is bandwidth/prune performance.

chrestomanci · June 5, 2020, 7:04pm

If you are serous about protecting yourself from ransomware, then you need to understand the threat. These days viruses and the like that arrive via spam are not the main problem. Instead there are a number of cyber-crime gangs that specialise in targeted attacks on medium size companies, charities and public bodies.

These gangs will remotely hack into your systems, get admin access on your network, map it out, and learn what is important, where your backups are, everything. Then usually during a weekend when you are home with your family they trigger the final attack. As well as encrypting everything, they will delete your backups (using the cloud credentials they found while mapping out your system. These gangs typically ask for million dollar ransoms, and often get them.

If you are responsible for an organisation that might be targeted in that way, then you need to make sure that there is at least one set of backups that is either fully offline, or which cannot be deleted using any credentials that can be found anywhere of your network. For example once your nightly backup has coppied all you data to S3, you could have a second account (with passwords not stored electronically), that syncs the data to a second S3 bucket that the criminals won’t be able to touch. (You also need to harden your network to prevent the criminals getting in in the first place).