Secure Backup Strategy - Requesting Comments


#1

Hello all,

I’ve been working to identify a good backup strategy for a while now and have identified restic as the best bet for achieving it. Unfortunately, implementing a secure repository isn’t something restic handles. Using restic-server would help, but only supports storing repositories on the local host.

Using rclone’s restic REST API implementation, I believe we can achieve off-site repositories that are secure against host compromise. Below is a design that uses rclone to implement this idea. As I’m sure I’ve missed something, please provide comments.

restic Secure Backup Strategy

To achieve a, somewhat, desirable backup strategy that prevents a host compromise from also compromising backups, we’ll implement a multi-repository backup in which one repository is completely controlled by the host (local) and the other by a central host (off-site). The central host will be responsible for the maintenance of the off-site backups of all hosts.

The Problem

The purpose of backing up data is to protect it against loss. In this case, loss can be defined as the inaccessibility of the desired data, not the undesired exposure of it. Such inaccessibility can occur as a result of hardware failure, software failure, user error, or malicious intent.

restic covers the simple case of hardware failure, software failure, and user error well by maintaining a versioned repository of critical data in an off-host location. In the event that something catastrophic happens on or to the host, a user can simply restore the data from the off-host location. All the user needs is restic, access to the repository, and any of the passwords associated with it.

When we consider malicious intent, however, restic provides little protection. restic’s encryption sufficiently protects the repository from access by an unauthorized party. But when it comes to an attacker that wants to destroy or hold data ransom, restic may not help.

As attackers become more sophisticated, they are putting more effort into the thoroughness of their attacks. Even automated attacks have started to consider the presence of backups and attempt to disable or destroy them. In the simple case of a host backing up to a repository, even remote, where it is able to delete or overwrite the contents, an attacker could do the same. restic alone has no ability to prevent this.

Some solutions presented to protect against this risk are to use restic-server in append-only mode or to restrict the S3 permissions of the backup user’s API access keys to prevent deleting files. With restic-server, the challenges are that the repository is stored locally and/or you need to maintain the backup storage. With the S3 API, while you can restrict the ability to delete files, you cannot restrict the ability to overwrite them. To deal with maliciously overwritten files in S3, enabling bucket versioning is often suggested. That may mitigate the risk, but offers an onerous restore process if an attacker does overwrite the repository files in S3.

Regardless, in both cases, an attacker with access to a single host (the host running restic-runner or the host with access keys to the S3 bucket), can make restoration difficult or impossible, by either destroying or corrupting the repository.

Local Backups

Local backups will be performed to either a local disk or a local network device. They will be performed by a scheduler on the local host that will also manage retention and cleanup. Because of the non-monetary cost of I/O to local storage (e.i. no per-API call or data transfer billing) and deduplication behavior of restic, these can be performed as frequently as desired.

Off-site Backups

Off-site backups will be performed through an rclone instance serving the restic REST API. rclone will restrict clients to append-only, protecting the repository from host compromise. Each host will still be responsible for initiating it’s off-site backup (though this may change), but will not handle retention or cleanup. Retention and cleanup will be handled by a separate host that can access the off-site repositories directly.

A second password will be added to every off-site repository that can be used for repository maintenance and data recovery. This will allow each host to use a unique password without needing to track them all.

Justification

If an individual host is compromised, the attacker will have all of the knowledge required to access both repositories. Though the attacker would be able to modify and delete data from the local repository, they could only read and add data to the off-site repository. While the attacker could destroy any local copy of the password, they cannot modify the passwords stored in the off-site repository. Without needing to keep a central copy of each host password, we can still restore data from the off-site backups using the management password.

In the event that the central host, password, or off-site storage is compromised, the attacker would not have access to the local repositories of each host. While the attacker could destroy the off-site repositories, all hosts could still backup and restore from local repositories.

In the event that there is a software issue or hardware issue that causes one of the repositories to become unavailable, the other should remain healthy.

Risks/Concerns

Data Duplication: Fortunately, local storage is fairly inexpensive, but we would be storing backups twice. This is in-line with the 3-2-1 backup strategy, anyway.

Backup Run-Time and Resource Usage: As we’ll be running backup operations twice to independent repositories, the backup time and resource cost is effectively doubled. This is annoying.

Data Security: Each repository, both local and off-site, will have it’s own password and encryption key. However, if the central management is compromised, the attacker will gain the ability to read the backup data of all hosts (via the management password). This could be mitigated by segregating the central management of repositories by data classification and protecting each accordingly.

Application/Protocol Design and Implementation Issues: These happen and can range from data corruption to compromise. One possible scenario is if an attacker were able to acquire the management password from a host by reading the key database of it’s off-site repository, then using a flaw in the encryption to reveal the management password. Since rclone does not provide the ability to restrict which host can access which repository, the attacker would then gain the ability read data from any repository that uses the same management password. This could be mitigated by using separate rclone instances and directories/buckets for each host. While it would limit an attackers range, it requires additional management and overhead.


#2

I don’t understand the problem you’re solving. The data is encrypted so an attack per se can’t access it, but if I understand you, you are concerned that an attacker could delete keys so as to prevent you from accessing your data. Couldn’t you just recreate the public key? (Obviously, the attacker has no access to the private key.)

Perhaps you could write up the problem that you are solving, then people could comment on the proposed solution.


#3

You’re correct, I didn’t provide a problem statement. This was just a dump of my internal documentation in case someone else finds it useful. I’ll add that in.


#4

I’ve added a problem statement to the original post.


#5

No, just run rclone serve restic on an isolated machine with a low attack surface (e.g. an EC2 t3.nano with suitably restricted security groups) against a cloud storage backend like S3.

If by “onerous” you mean “run this one command” then okay:


Either of these approaches are more than enough to address your concerns, are incredibly simple, and don’t add all of the complexity that you propose here.

Both together (append-only server with S3 versioning, and the EC2 having an IAM role suitably configured not to allow version deletion) would be even better.


#6

That’s almost exactly what I was proposing. However, just because it appears to have a low attack surface doesn’t mean that it’s completely secure. If that is the only backup and someone does gain access to it, you could loss your backups. It is fair to say, though, that you would still have the hosts that the backups original came from. Whether or not that is a solution depends on the purpose of your backups.

While s3-pit-restore seems like a solution, and probably is for most people, it still doesn’t mitigate the risk. Most people don’t have to worry about a motivated attacker, but some do. If the attacker were motivated, they may slowly corrupt the files you have in S3 over time. Say an attacker corrupts a random file when they gain access. Then, a couple of weeks later corrupt files that were just created. You could happily run most restic operations without any indication of corruption. Unless you run a prune, a check with --read-data, or a restore that happens to read the corrupt chunk, you wouldn’t know. Say you’ve discovered the corruption and now want to go back in time. What point-in-time do you go to? Do you still have those versions? Setting up versioning without a lifecycle policy is unlikely.

I’m not saying everyone needs this. It’s mostly for the paranoid and high-risk people and organizations. For my use case, the suggestions I had found for wrapping a restic repository with security were not sufficient. I believe this does meet my requirements, but would like to know where it still falls short of making restic backups as secure and resilient as possible. I also appreciate less complex alternatives. Second to this, I would suggest S3 with versioning and policy to deny the API keys from deleting versions. I’m not sure having an rclone instance in there provides any benefit.

Keep the discussion going!


#7

I feel like restic-aware slow corruption would be less likely to take the form of “slowly corrupt data files” and more likely to be “slowly replace snapshots with snapshots that contain nothing/garbage” as that would be harder to detect. check --read-data wouldn’t even detect that, and a prune would happily discard the good data that’s not used anymore.

If you want to use restic, I’m not sure how much more secure you could get than an append-only server on an EC2 that doesn’t have any ports open except the restic server port, and with a whitelisted set of IPs. Compromise would then require either a vulnerability in rclone (certainly possible) or compromise of the AWS account (at which point you’ve already lost; the attacker can just delete the S3 bucket, unless you use a different AWS account for the bucket).

For what it’s worth, I do have one local backup that is synced to B2 (so I guess 2-1-1?) but this is more because I don’t want to pay B2 egress rates to restore unless my local backup is destroyed.

The production servers for my company use the same system with a dedicated backup server, whose storage is in a RAID10. So I would guess that qualifies as 3-2-1. It’s still primarily about avoiding egress fees when restoring backups, but the IAM policies still help to avoid an attacker being able to destroy the backups.


#8

I must be misunderstanding the documentation. Can you explain what the purpose of check --read-data is?

Your point about an AWS account being compromised is valid assuming the account is not restricted. While I wouldn’t recommend having a single account with that level of access to both resources, I’m also not sure it matters. If an attacker can get to either, they can corrupt or destroy the off-site backups, which could impacts multiple hosts, the entire site, or the entire organization.

It’s also valid to say that we can greatly decrease the external attack surface of the restic REST server by using whitelists and firewalls. Whether implemented in AWS, Digital Ocean, on-premise, or otherwise, we’re still talking about the same thing. It doesn’t matter where it is, the goal is to create a secure off-site backup via an append-only restic REST service, but still allow repository maintenance. However, when that host fails or is compromised, it shouldn’t disrupt the organization’s ability to operate.


#9

This command ensures that all of the data in the repository can be read, and that the data’s SHA256 sum matches its ID, in addition to other checks, such as making sure that all object references actually exist. Basically, the operation asks, “can I successfully and fully restore every snapshot in the repository?”

This does not ask, "do all snapshots actually contain useful backups?"

This will not detect cases where a snapshot has been rewritten to contain nothing, for example – as long as the new snapshot has the correct ID according to its contents and references an empty tree already in the repository, restic check won’t see any problems because the data is valid.

This is the difference between corrupting an existing file (check would notice that) and replacing snapshots with new, valid snapshots that don’t contain a useful backup (check would see those as valid snapshots).

And this is why append-only mode is of critical importance. With this mode enabled, an attacker replacing a snapshot with a useless-but-valid snapshot won’t be able to remove the good snapshot from the offsite repository.

Not with appropriate restrictions and versioning on the S3 bucket. Then it would require a compromise of the account that owns the bucket; compromising the account that owns the EC2 would, at most, allow the attacker to sabotage the REST server system. This would disable (bad) or possibly silently corrupt (worse) new backups, but they would not be able to permanently erase existing backups.

There are bulletproof ways to detect this situation, however – one would be to build an AMI for a configured and ready-to-go REST server, and have the machine be recycled daily using offsite scripts. AMIs cannot be altered once created, so you have confidence that the image is pure.

Of course, if an attacker was able to compromise the EC2, then the AMI is probably vulnerable and it could be easily compromised again. It is necessary in any backup system to have periodic sanity checks performed by a human to make sure that there are good backups being taken and that backups can be restored. (Aside from attackers, this is also a good way to verify that the system is even working correctly to begin with. Bugs or misconfigurations can cripple an existing backup system.)


#10

I see where we are misunderstanding each other. If I was to attack a restic repository with the intention of corrupting it and doing so silently, I’d go after the data blobs/packs, not the snapshots. I’ll admit that most S3 providers make it fairly easy to detect this sort of attack, but you’d have to be aware of the risk and specifically monitor for it. I prefer prevention over detection where feasible.


Let us loop back around to the purpose of both append-only and versioning.

Versioning

I think we agree that the point of implementing versioning on the bucket storing the restic repository is to allow for recovery in the event that an underlying object (i.e. pack, snapshot, key) is deleted or modified in the bucket. The thought is that you could identify when the change was made and roll-back the change to recover the original contents of that object, restoring the repository. As you’ve pointed out, this could be done across an entire bucket.

The problem I see with this is that of lifecycle policies. Though not entirely infeasible, and even required in some industries, most individuals and companies don’t want to keep every backup ever created. To be financially responsible, as you do pay monthly for every bit stored in cloud storage, they would rather define some sort of retention. As such, a bucket with versioning will likely have a lifecycle policy defined on it that will clean up old versions and deleted versions of objects.

Let us consider 2 cases: 1) a data object is deleted and 2) a data object is corrupt

When a data object is deleted, backup, restore, forget, prune, diff, etc. can all run without any indication. While check will find that the object no longer exists and throw an error, it’s also a bit API call heavy. However, running check within the interval of deleted object cleanup would provide an indicator that there is an issue.

When a data object is modified (e.g. changing a single bit anywhere but the metadata at the tail), all of the above operations are still likely to execute successfully. However, check will also execute without any indication that a pack has been tampered with. To detect that sort of tampering, the hash of each blob would have to be verified (e.g. check --read-data). Since doing this regularly across a bucket is expensive, it may not occur before the old version is removed by a lifecycle policy. I’m also not sure if prune verifies the blob hashes as it repacks them. Assuming that it doesn’t, the corruption could move forward through repository maintenance.

Append-only

Using an append-only server is necessary to have repository security when faced with a backend that doesn’t provide similar protections. However, with versioning enabled on a bucket and no intention of removing old objects, it seems pointless. If we rely on versioning to handle the case where an object is overwritten or deleted, what protection is provided by the append-only server? I’m failing to see the necessity in the presence of versioning, but would like to be sure that I understand.

In my proposal, we would forgo versioning and instead use the append-only server. Versioning on the bucket could be enabled, but would be nothing more than a delayed deletion (oh, Recycle Bin. I’ve always disabled you too.)


#11

That would be a bit silly, given that restic check --read-data will detect your attack, whereas it won’t detect substituting valid-but-useless snapshots for good ones.

restore will notice if the data object is part of a file being restored.

That depends where you run the operation. If you’re running it outside of AWS then you have to pay egress data rates. If you run it inside AWS you have to pay compute rates. The compute rates will likely wind up being much cheaper, and you don’t have to provision an EC2 except when checking. You could use even use spot instances to check, dramatically lowering the cost of the check operation.

Note that, even considering all of these points, your 3-2-1 design doesn’t remove these vulnerabilities.

  • If your local copies get corrupted by an attacker, that corruption is replicated to S3. Both copies are corrupted.
    • Of course, one can run restic check --read-data on the local copy on a regular basis since it’s ~free.
    • But without versioning of the local data, you can’t recover a corrupt pack. S3 with versioning has an advantage here.
  • If the S3 copy gets corrupted and the mtime/size doesn’t change on S3, the S3 copy will remain corrupted until the pack is deleted or rewritten. (Since the rewrite would happen on the local copy, this would heal the S3 repository.)
    • This is only a problem if the local copies get destroyed; however, the purpose of having an off-site backup is to recover if the local copies are lost. In that case, the off-site backup would also be no good.

So I would submit that you still need to scrub the S3 copy regularly regardless of whether it is your primary or secondary repository copy.


#12

Maybe, that depends on a lot of factors. I feel like the absence of data in backup reports would raise suspicion. How often do you run restic check --read-data?

That is a good point, but it is vendor specific. I’m not sure it makes sense to store restic backups in AWS S3 when Wasabi and B2 are so much cheaper. I also don’t know if restic would play well with Glacier, but Glacier would change a lot about this anyway.

I’m sorry if I wasn’t clear. The local and off-site repositories would be independent of each other, not one a replica of the other. If we were to just replicate a repository, your points are correct.


You’ve given me some new options. An alternate design might be to setup a pseudo append-only bucket policy with versioning, as you’ve suggested, instead of using an append-only server. To deal with retention and corruption, a VM with no/low fees to/from the chosen storage vendor (e.g. AWS S3 and EC2, Wasabi and Packet) can be launched on an interval. The VM would revert versioned objects where they were overwritten, run restic check --read-data, run restic forget --prune, then clean up any objects versioned from deletes. This would simplify the backup infrastructure to the off-site repository and better secure it. The maintenance VM wouldn’t require inbound connectivity nor would it be running all the time, removing that attack surface.


#13

If you’re keeping a record off-host of which snapshots you expect to see, and regularly comparing them, then yes, this situation could also be detected.

Often enough that corrupt data would not have been purged from S3 by lifecycle management rules.

It doesn’t know how to talk to Glacier directly, and that wouldn’t work well anyway since you have to wait 1-5 minutes for retrieval jobs, and that’s at the most expensive retrieval tier. You would be better off storing in S3 and using lifecycle management rules to transition only files under the data/ prefix to Glacier after some number of days. I believe for this to work, you would have to disable the use of parent snapshots for backups, since restic needs to read the tree objects under data/ to see what might have changed.

However, since Glacier has a 90-day minimum for stored items, pruning at all has the potential to incur extra charges.

Basically, a very bare minimum of functionality would work in real-time and the rest would require expensive retrievals and/or early deletions.

(B2 is only $0.001/GB-month more expensive for storage than Glacier, anyway… and egress is significantly cheaper.)

Aha, so you’re running restic backup twice then, and the repositories have totally different master keys and snapshots IDs? That would definitely be sufficient to keep corruption from being synced, since no syncing is happening.

I’d change the order of these operations:

  1. Revert corrupted objects.
    • Note that all you have to do to detect this is fetch each file in the repository and compare its SHA256 sum to its filename. If the sum doesn’t match, look for a prior version to restore and run the same test on that version. If the sum matches then the file is intact and any prior versions would be redundant anyway.
    • This is safe to do on a repository that is being written to, because S3 uploads are all-or-nothing; no partially-completed uploads will be visible.
  2. restic forget --prune
  3. restic check --read-data
    • Swapping these allows check to do less work, since it doesn’t have to verify the integrity of data we’re going to discard anyway.
  4. Delete all prior versions of all objects.
    • This may not be safe to do on a repository during a backup. If an attacker manages to corrupt a pack that was uploaded after step 1 was completed but before step 4 begins, you would delete the good version. The window for such an attack is not large, however, as steps 2 and 3 require an exclusive lock so no backups would be taking until step 3 completes.
    • To be honest, I would recommend skipping this step unless it will save you a substantial amount of money. Letting the lifecycle rules run their course is safer. A bug in your “remove all prior versions” script could easily destroy the entire repository.

Unavailable index files