Batch script to avoid stale locks & duplicate instances

#1

Hello,

I have been having some trouble perfecting my restic batch scripts. During a lengthy forget process, they will often end up running simultaneously if I include unlock logic. If I don’t include unlock logic, they will inevitably stop running when a VM instance is rebooted, leaving me with no backups for months until I check in on it and find a stale lock.

Does anyone have a good script they like to use to avoid these problems? Here is my current one which tends to stack up.

#!/bin/bash

cd /root
source /root/.restic-env
echo `date` >> /root/backupLog.txt

# this does not appear to actually work, multiple instances can get stacked on top of each other.
if pgrep "/usr/bin/restic" >/dev/null 2>&1 ; then
        echo already running elsewhere
        exit
else
        /usr/bin/restic unlock
fi

nice -10 /usr/bin/restic backup /var/svn >> /root/backupLog.txt 2>&1
nice -10 /usr/bin/restic forget --prune --keep-hourly 56 --keep-daily 90 --keep-monthly 12  >> /root/backupLog.txt 2>&1
#2

Hello! I had this problem before with my bash script but solved it making the bash script to create a “lock” file for it. First, my logic was not so great. Since I have a couple of machines backing up to the same repository, my first step to make this work better was to chose just one machine that will handle the forget and prune for this repository. So all machines will backup to the same repository but only one of them will perform the forget and prune to avoid having all of them failing after one of them started the process. The second was to create the “lock” mentioned before because that way if the script was still running it will not attempt to run again. This is my bash script and there you can find a function called rescript-lock and I use it at the beginning of every other function:

function rescript-lock {
  if [ -e "$lock" ]; then
    echo "WARNING: [$repo] repo is already running..."
    echo "If you are sure $repo is not running, type"
    echo " "
    echo "  rescript $repo unlocker"
    echo " "
    echo "This will remove the lock for [$repo] repository."
    echo ""
    echo "Lock file info:"
    stat "$lock_dir/$repo.lock"
    exit
  fi
  touch "$lock"
  trap 'rm -rf "$lock"' INT QUIT TERM EXIT
}

I use trap so if the process is interrupted it will delete the lock file. What this does is to create an empty file called repo_name.lock so if the file exists and my cron job attempt to start the script again, it will echo a message saying the script is already running, show the lock file info and exit. If the file is not there it will proceed.

Usually, there is no need to use unlock unless, of course, you’re having problems like this. Since I started using this method I haven’t had any problems with it. So you could easily start your script with:

lock="/root/script.lock"
if [ -e "$lock" ]; then
  echo "WARNING: Backup is already running..."
  echo ""
  echo "Lock file info:"
  stat "$lock"
  exit
fi
touch "$lock"
trap 'rm -rf "$lock"' INT QUIT TERM EXIT
# Rest of the script below

I hope this helps.

#3

Hello @Sim,

My suggestion (which is what I do here) is to write the PID of the shell to the lock file, and when starting and seeing there is one, checking whether this PID is still up (with kill -0); if not, it’s a stale lock so the script can simply remove it, write a new one and go right ahead.

Sorry no code examples as I’m in a hurry right now, but should be common enough to google.

Cheers,
– Durval.