As a result of a recent failure I’m overhauling my backup restore process. I thought others might find it interesting or useful, and that others might be able to help improve it.
The Problem
Until recently I thought I had quite a good backup and restore test system. My PC and servers are backed up using restic and sometimes another method to multiple destinations - local hard drive, external hard drive, offsite external hard drive, and AWS S3. I have scripts on my computer that I run that tests restoring a few files from those repos.
I had my Raspberry Pi 4 home automation server fail recently - the six month old m.2 SSD failed which became obvious when I restarted it. This server is backed up nightly to S3 using restic and I can see the snapshots appearing in S3 if I look. “No problems!” I thought. “I’ll get a new disk and restore from my backups”.
When I went to restore my files I ran into a problem… I didn’t have a copy of the restic repo password anywhere. I have passwords for many other repos, but not this one. Fortunately after a few tries the server booted and was up for about a minute before it crashed, and I was able to get the repo password. If I hadn’t managed this I’d have lost months of work - docker compose files, docker app configuration (home assistant, postgresql, pi hole, nginx, etc), app daemon apps, dchp allocations, etc.
Looking at my backups I also discovered that my web server backups had stopped working. I had edited the script that cron calls and made a syntax error. There was no error reporting.
Approach
For this new approach I will assume all my computers have been stolen or broken, I’ve lost my phone, and I need to restore my PC, my home servers, and my main web server. I won’t use existing scripts or anything else. I’ll do this annually.
Initial Process
To start with I’m going to go through all my computers and servers, make sure I have a restic repo locations and passwords documented in my password store.
I run my password store on my web server, so if I lose my phone, my PC, and my server I may lose access to my passwords. To mitigate this I will keep the repo list and passwords in a KeePass vault stored on an external hard drive I keep at a friends house.
Regular Check Process
I have a regular reminder every three months to check the repos to make sure backups are appearing as they should, in each destination.
Restore Process
Every year, on a new PC / VM, I will set up everything required to test my restores. I’ll install the AWS CLI, restic, and go through the list of repos to restore a couple of recent files from each repo.
More Copies
I’m also going to save a copy of the important files from my server to my PC occasionally, extracted from restic and zipped. That way if I manage to break things in a weird way I should at least have an older copy of the files.
I don’t trust all my data to any one tool. As well as restic providing a backup, I also upload my most important files to S3 in a versioned bucket once a week. That way I can also access them remotely if I need to.
Opinions
Does anyone have any suggestions to improve this process?