Critique my backup strategy

flea · March 11, 2020, 10:33am

Hi all!
And thanks to Alexander @fd0 for a great piece of software. I love the simplicity of it. Commands are simple to understand. Different backends are a breeze. Backups are fast.

Just wanted to check if my backup strategy is sound. Would you guys do anything different?

Weekly: OFFSITE backup to B2 on every Monday at 2:23 am
Weekly: LOCAL backup to a different local machine on every Tuesday at 2:23 am
Bi-monthly: prune and check indexes OFFSITE B2 on the first Wednesday of every other month at 2:23 am
Bi-monthly: prune and check indexes LOCAL backup on the second Wednesday of every other month at 2:23 am
Every six months: verify all files with check --read-data OFFSITE backup on the first Thursday of every six months at 2:23 am
Every six months: verify all files with check check --read-data LOCAL backup on the second Thursday of every six months at 2:23 am

If someone wants to copy my strategy, my crontab is attached below.

MAILTO="flea@my.domain"
SHELL=/bin/bash
#
# weekly OFFSITE backup to B2 on every Monday at 2:23 am (using variables in .restic-env)
23 2 * * 1 . /home/flea/.restic-env; /usr/bin/restic backup /zpool/Silo/; /usr/bin/restic forget --keep-hourly 24 --keep-daily 7 --keep-monthly 24
#
# weekly LOCAL backup to local machine on every Tuesday at 2:23 am (using variables in .restic-env-local)
23 2 * * 2 . /home/flea/.restic-env-local; /usr/bin/restic backup /zpool/Silo/; /usr/bin/restic forget --keep-hourly 24 --keep-daily 7 --keep-monthly 24
#
# prune OFFSITE B2 on the first Wednesday of every other month at 2:23 am (using variables in .restic-env)
23 2 1-7 */2 * if [ `date +\%u` == 3 ]; then . /home/flea/.restic-env; /usr/bin/restic prune; /usr/bin/restic check; fi
#
# prune LOCAL backup on the second Wednesday of every other month at 2:23 am (using variables in .restic-env-local)
23 2 8-14 */2 * if [ `date +\%u` = 3 ]; then . /home/flea/.restic-env-local; /usr/bin/restic prune; /usr/bin/restic check; fi
# 
# check OFFSITE backup on the first Thursday of every six months at 2:23 am (using variables in .restic-env) 
23 2 1-7 */6 * if [ `date +\%u` = 4 ]; then . /home/flea/.restic-env; /usr/bin/restic check --read-data; fi
#
# check LOCAL backup on the second Thursday of every six months at 2:23 am (using variables in .restic-env-local)
23 2 8-14 */6 * if [ `date +\%u` = 4 ]; then . /home/flea/.restic-env-local; /usr/bin/restic check --read-data; fi

ProactiveServices · March 11, 2020, 8:34pm

In my automation I only ever run forget/prune after a successful --read-data check, or the last of a --read-data-subset check, just to be safe. I also run a restic unlock before operations, then check for existing locks before proceeding, and run a cache --cleanup infrequently, usually after a forget/prune run. Depending on the repo I run a regular check at the end of every day, or week.

flea · March 11, 2020, 9:59pm

Cool! Thanks! Care too share exactly how you automated things? On what schedule? In what order? And maybe most important, why that strategy? Just started out with restic, and trying to wrap my head around best practices.

ProactiveServices · March 11, 2020, 10:48pm

I always go for a conservative approach: observe exit codes of every time you call restic, and bomb out/send email/warn in logs. It’s critical to know if something goes wrong.

Similarly, log successes, but be sure not to swamp yourself. Knowing that it usually takes 75 minutes for a backup is useful if one day it suddenly jumps to three hours for no reason. Disk failing? Network crapping out? Something may be amiss. Also if a backup takes two seconds…that might be a bad thing.

After a backup, run a check if your repo size and link allows. If you’re backing up every hour then perhaps run a check on the last run for the day. If the repo is large and/or the link slow, perhaps once a week.

Before I run any forget or prune on the repo I prefer to complete a check --read-data and then only proceed if there are no errors, and I also run a check afterwards. Again, if a check --read-data takes too long for your use case, perhaps using the subset check to break down the process, and run maintenance after the last subset completes OK.

As it stands for some of my local repo groups I have the following rough schedule. All operations are logged and any failures are also emailed to me.

Hourly backup from 07:00 to 01:00.
At midday, the log of the hourly backup is emailed to me as a “heartbeat” so I know things haven’t silently given up.
At 02:00 a check is performed. If today is a Sunday then a check --read-data-subset is run instead, for x/4 depending on which Sunday of the month it is (Fifth Sundays just do a standard check).
If the fourth Sunday’s check --read-data-subset is OK, I run a forget then a prune, then a check
All Sunday maintenance sends a heartbeat email.

This way the overall process is not bogged down by drawn-out verification, but the repo data is gradually read and checked over four weeks so any corruptions are recognised fairly early. This methodology helped alert me to a broken blob on the day it appeared and allowed me to repair the repo early.

This is on a Linux system where I have a shell script which accepts parameters that runs the restic calls, handles log and failure logic. Systemd service units call the script with the particular parameters for that type of operation (backup, heartbeat backup, check etc.) and systemd timers which have the calendaring logic.

I’d suggest start off simple for any area you’re unfamiliar with, and build up the methods as you go. Trying to do everything in one go will make it a painful experience!

flea · March 12, 2020, 9:47am

Thank you! That’s great. I will amend my strategy. Still doing the 3-2-1 strategy. But with more verification. How does this (much simpler schedule-wise) strategy look to you?

Mondays: LOCAL backup & check indexes at 2:23 am
Tuesdays: OFFSITE B2 backup & check indexes at 2:23 am
First Wednesdays of month: LOCAL check --read-data at 2:23 am
Second Wednesdays of month: OFFSITE check --read-data at 2:23 am
First Sundays of month: LOCAL forget; prune; check indexes at 2:23 am
Second Sundays of month: OFFSITE forget; prune; check indexes at 2:23 am

Thanks for your time!

Francis

rawtaz · March 12, 2020, 11:57am

Seems complicated to me. Here’s what I do:

Whenever I did something on my computer that is worth backing up (pretty much every day of course), I fire off two backup scripts in my terminal. All those two backup scripts does is back up my ~ (with some exclusions) to two different external repositories. One is a FreeBSD with ZFS on it, and has never failed, the other is just a Linux with EXT4 on it. Prune and check I do every now and then when I feel like it.

Each backup takes like three minutes to run for normal changes, would take a few more minutes if I’m on a slow connection or changed many GBs of data. Why make it more complicated than this?

I have other users that aren’t terminal-savvy, and they have a little icon to click that starts the backup process. Equally easy for them.

flea · March 12, 2020, 12:30pm

Thanks @rawtaz!

Yeah, I get you! Simplicity is king.

This is important data that is updated very infrequently. One week of losses would not be a problem. But a complete loss of the data would be irreplaceable. I would probably forget to back it up if I did restic backup manually.

For me it’s easier to just set up a backup strategy in cron on my cobbled together Debian NAS with zfs. The NAS folder is backed up every week,1) locally to an OSX server and 2) to B2.

restic check (only indexes) is run every week right after the backup (quite cheap verification). The restic check takes like a minute locally.

Every month I plan to do some more verification with restic check --read-data on both local and B2 repositories. After restic check --read-data i plan to do restic forget and restic prune. Finishing off with a restic check. This might take some time, but it’s only about 150 GB of data. We will see.

I don’t want to rclone the local and B2 repositories but run them independently. If one is borked, the other backup might still work.

See any big holes in the strategy?

For those looking to steal the schedule:

#
# weekly LOCAL backup and check every Monday at 2:53 am (using variables in .restic-env-local)
53 2 * * 1 . /home/flea/.restic-env-local; /usr/bin/restic backup /zpool/Silo/; /usr/bin/restic check
#
# weekly OFFSITE backup and check to B2 on every Tuesday at 2:23 am (using variables in .restic-env)
23 2 * * 2 . /home/flea/.restic-env; /usr/bin/restic backup /zpool/Silo/; /usr/bin/restic check
#
# monthly full check of LOCAL backup on the first Wednesday of every month at 2:23 am (using variables in .restic-env-local)
23 2 1-7 * *  if [ `date +\%u` = 3 ]; then . /home/flea/.restic-env-local; /usr/bin/restic check --read-data; fi
#
# monthly full check OFFSITE on the second Wednesday of every other month at 2:23 am (using variables in .restic-env)
23 2 8-14 * * if [ `date +\%u` = 3 ]; then . /home/flea/.restic-env; /usr/bin/restic check --read-data; fi
# 
# monthly forget, prune & check indexes of LOCAL backup on the first Sunday of every month at 2:23 am (using variables in .restic-env-local)
23 2 1-7 * *  if [ `date +\%u` = 0 ]; then . /home/flea/.restic-env-local; /usr/bin/restic forget --keep-hourly 24 --keep-daily 7 --keep-weekly 8 --keep-monthly 24; /usr/bin/restic prune; /usr/bin/restic check; fi
#
# monthly forget, prune & check indexes of OFFSITE backup on the second Sunday of every months at 2:23 am (using variables in .restic-env) 
23 2 8-14 * * if [ `date +\%u` = 0 ]; then . /home/flea/.restic-env; /usr/bin/restic forget --keep-hourly 24 --keep-daily 7 --keep-weekly 8 --keep-monthly 24; /usr/bin/restic prune; /usr/bin/restic check; fi

ProactiveServices · March 12, 2020, 3:26pm

If I were you, and time permitted, I’d run the forget/prune after the read data check completes, as you know the repo is definitely consistent at that point in time. Otherwise that looks good to me

ProactiveServices · March 12, 2020, 3:28pm

Yep, but in the repos involved it has to be fully automated as the data is changed frequently and I’m obliged to ensure backup consistency. Some of my other backups aren’t quite as intensive or fully automated as they can afford to be a less stringent

flea · March 12, 2020, 3:49pm

Thanks @ProactiveServices!

The reason that I put a big time delay between restic check --read-data and forget, prune & check was to give me time to do things if the read-data turned up errors. Before I do forget and prune. It gives me time to check disks, restore B2 snapshots or just stopping the forget & prune cron job to give me even more time.

What do you think about that strategy?

F

764287 · March 12, 2020, 4:45pm

You could use && to only execute the 2nd command if the 1st command returns no errors.
restic check --read-data && rectic forget --prune

flea · March 12, 2020, 5:54pm

Thanks @764287! Mind blown! LOL. That’s a great idea.

Just a thought: isn’t restic check --read-data and restic forget --prune the most expensive (timewise/cpu-wise/bandwidth) operations that restic can do? I’m worried it will take all week. But on the other hand I only have 150 GB of data to backup, so maybe it won’t be that bad?

ProactiveServices · March 12, 2020, 6:37pm

Best to get an idea of how long each operation currently takes, how you expect that to change, and plan accordingly. The reason I automate prunes after successful read data checks is that in the example I gave, the process is all automated. If I’m not able to look into a repo problem I don’t want to potentially cause further damage. I also know how easy it is to forget to keep up with some processes when other work or life gets in the way!

flea · March 12, 2020, 6:58pm

You’re right @ProactiveServices. You and @764287 convinced me to try the full automation.

I’m new to bash scripting so if you have the time: did I do the right thing with the && in the crontab?

I chained restic check --read-data && restic forget --prune && restic check

#
# weekly LOCAL backup and check every Monday at 2:53 am (using variables in .restic-env-local)
53 2 * * 1 . /home/flea/.restic-env-local; /usr/bin/restic backup /zpool/Silo/; /usr/bin/restic check
#
# weekly OFFSITE backup and check to B2 on every Tuesday at 2:23 am (using variables in .restic-env)
23 2 * * 2 . /home/flea/.restic-env; /usr/bin/restic backup /zpool/Silo/; /usr/bin/restic check
#
# monthly full check of LOCAL backup on the first Wednesday of every month at 2:23 am (using variables in .restic-env-local)
23 2 1-7 * *  if [ `date +\%u` = 3 ]; then . /home/flea/.restic-env-local; /usr/bin/restic check --read-data && /usr/bin/restic forget -H 24 -d 7 -w 8 -m 24 --prune && /usr/bin/restic check; fi
#
# monthly full check OFFSITE on the second Wednesday of every other month at 2:23 am (using variables in .restic-env)
23 2 8-14 * * if [ `date +\%u` = 3 ]; then . /home/flea/.restic-env; /usr/bin/restic check --read-data && /usr/bin/restic forget -H 24 -d 7 -w 8 -m 24 --prune && /usr/bin/restic check; fi

ProactiveServices · March 13, 2020, 2:37pm

Looks OK. One thing I didn’t mention is to make sure all of your error conditions/handling work as expected, such as renaming the repo and checking to see that your expected fallbacks/behaviour is effective.

flea · March 13, 2020, 2:53pm

Great advice @ProactiveServices. Will do testing before I leave cron alone to do its thing. Thanks for all your advise and help @ProactiveServices, @rawtaz, @764287.