Lock never goes away

LanceHaverkamp · May 1, 2019, 5:21pm

I have systemd doing a daily backup & forget, plus a monthly prune, to a B2 bucket.

I see messages about old locks, like this:
Fatal: unable to create lock in backend: repository is already locked exclusively by PID 22258 on localhost.localdomain by lance (UID 1000, GID 1000)
lock was created at 2019-05-01 00:00:26 (10h56m15.487366228s ago)
storage ID 4ad352c2

I don’t know why I’d have a lock that’s11 hours old; when I’m only doing incremental backups on a desktop. Do we have detailed documentation on how to automatically deal with locks, making sure things happen if they get bumped by a lock, etc., so we can get to an end-user experience that’s set-it-and-forget-it easy?

betatester77 · May 3, 2019, 7:45am

I had a problem that a cron based restic backup did not work anymore for some days because there was also an “orphaned” lock. I just noticed it by chance.

Not sure how to avoid this. I was thinking about doing an restic unlock via cron one time a week - just to be sure.

LanceHaverkamp · May 5, 2019, 7:54pm

I’m still working on this, it seems to be a critical problem: My daily systemd timer runs once, then never runs again! Forcibly removing locks before each daily backup is a bad work-around since it might break other things…like a pruning job that could be running concurrently, on a less frequent timer.

Any idea if this is more likely a Restic problem or a B2 problem?

betatester77 · May 5, 2019, 8:11pm

I had this problem only once with a sftp backend. Still I ask myself what could be a solution. Backends can go down during backup. In my case it may was maintenance work on the backend server. Such things happen - and the lock remains while I feel safe having my daily cronjob that can’t run.

In your case I don’t think it’s a restic problem. Does the same problem occure when you run the backup manually?

cdhowie · May 5, 2019, 9:19pm

Check that your B2 key has permission to delete files.

Also, consider having cronjob output emailed to you or a sysadmin mailing list, and also run restic backup with -q. This way the only output will be related to errors, and then you’ll get an email when something goes wrong.

LanceHaverkamp · June 7, 2019, 4:01am

After manually removing the last lock, it correctly did a daily increment for 5 days, then it got stuck on a lock again. I have no idea if this is a Restic problem or a B2 problem. But as it is, there’s no way I can trust this setup for end-user off-site backups, it needs way too much hand-holding.

I can envision a solution where Restic looks to see if another instance of Restic is running, if not it could safely remove any lock, then proceed. But that’s not something I (or any other end-user) would know how to accomplish.

sulfuror · June 7, 2019, 5:11pm

I wonder if that could be harmful. I use restic with sftp backend (one local and one remote) and never had any problems unless I’m doing some maintenance like forget and prune for one of them, and still, when the process ended all hosts go back to normal with usual backups. I’m guessing your problem is probably something like @cdhowie says. A stale lock is very rare if you’re just doing backups without doing other stuff like forget, prune or check.

At some point I had a similar problem but it was because a backup started, didn’t finished and then my cron executed a backup again. So, what I did was to make my bash script to create a “lock” file so it will verify if the script was running before actually executing the script; if the “lock” file is present the script exists. If letting backups run unnatended, I think you’ll be better off making or using an existing shell script that creates/check if the “lock” is present before executing so you don’t end up with this problem again. It is pretty simple:

#!/bin/bash
lock="/tmp/restic_lock"
if [[ -e "$lock" ]] ; then
  echo "script is already running..."
  exit 1
fi
touch "$lock"
trap 'rm -rf "$lock"' INT QUIT TERM EXIT
#######################
# REST OF YOUR SCRIPT #
#######################

This way, when the script exit it will delete the “lock” file created. If you’re sure you want to execute unlock first, you can do something like this:

restic snapshots > /dev/null
if [[ $? -ne 0 ]] ; then
  restic unlock
fi

I don’t think this is ideal but if I’m not mistaken, snapshots is the only command that doesn’t do anything “important” that will return an error if there is a stale lock.

cdhowie · June 7, 2019, 5:15pm

FWIW, I haven’t had any issues with stale locks that didn’t turn out to be something else. In particular, I’ve found stale locks in the following situations:

The machine was in the middle of a backup when the hosting provider decided to reboot the machine unannounced for emergency maintenance.
The backup was against an S3 bucket with versioning enabled, and a transient S3 problem caused a deleted lock to appear to exist when listing objects, but fail to be retrieved when actually trying to get the object.
There was too much happening on the machine and restic ran out of memory and was subsequently killed by the kernel’s oom-killer, leaving a dangling lock and no completed backup.

LanceHaverkamp · June 7, 2019, 10:58pm

I think this is a time-of-day shutdown issue. I’m using a daily timer in systemD, I’ve been doing long enough to notice that it tends to run one second after midnight. So I’m closing this & marking it solved. I’ve gone back to trying to get backups done at shutdown, but that’s got issues as well.