Locking issues (stale locks)

wilhelmy · February 9, 2021, 12:20pm

Greetings!

Restic currently fails for me with 82 stale locks. Since I need to run a backup but still would like to investigate how these locks appear, I tried the following:

root@mysql-backup:~/locks# restic --no-lock cat lock fc3da18450f46bad9e3bc45c5ca262196d09d1f91a62972224c86fdf800f6555
repository f2e5bf7e opened successfully, password is correct
unable to create lock in backend: repository is already locked exclusively by PID 32298 on mysql-backup by root (UID 0, GID 0)
lock was created at 2021-02-08 19:18:53 (16h57m49.201173616s ago)
storage ID 0025539a
the `unlock` command can be used to remove stale locks

Aside from that, any ideas why there might be so many stale locks from yesterday that restic cannot remove by itself? The docs state that restic can automatically remove stale locks in case it’s running on the same host and the PID of the locking process does not exist anymore. Apparently that’s not entirely true?

Is it safe to run restic {backup,prune,forget} with --no-lock (the documentation states that locks are used mainly for performance), and if so, what are the implications?

Also, how can I find the offending exclusive lock and delete only that?

In case it matters, backups are stored on Backblaze via b2 API.

cdhowie · February 10, 2021, 5:06pm

No, this is not safe. Operations requiring an exclusive lock cannot run concurrently with any other operation without risking data loss. For example, running prune and backup at the same time is very likely to lead to a repository that references data it does not contain.

wilhelmy · February 10, 2021, 5:25pm

Thanks, this’ll probably already help me work around my problem. Is simultaneously running two backups on different machines going to cause locking trouble? I have randomized time offsets for different machines at the moment, but I fear they might collide.

It also seems to me that the lock-cleanup code isn’t completely right. I ran another test backup to a local test repo in /tmp yesterday with only a file and a directory in it, and after the four calls to “restic backup” in this paste I’m left with two locks. Not sure whether or not this is a known bug.

cdhowie · February 10, 2021, 5:32pm

Backups do not take exclusive locks, so no. See this table:

jarm0 · February 10, 2021, 6:43pm

Just out of curiosity - is it even a good idea to backup to the same restic repo from different machines? I so far have used a separate restic repo for each machine because I thought that it will cause problems by backing up multiple machines into the same repo. I see it as a problem especially in cases where files with same paths exist on different machines with different contents. What use-case is it even to use same restic repo for more than one machine?

cdhowie · February 10, 2021, 7:30pm

Deduplication is repository-wide. If you are doing full-system backups and the machines all run a common/similar OS, the common OS files will be deduplicated even between machines, which can significantly reduce the amount of data you need to store. Obviously this extends to more than the OS files – any common data between the machines will be deduplicated.

On the other hand, more objects in a repository means larger restic caches on each machine and more RAM required to interact with the repository.

jarm0 · February 10, 2021, 7:50pm

I see. Do I understand correctly that If I’m backing up multiple Ubuntu 20 machines and their /etc directories then files with the same content will be backed up only once? What does “same content” even mean in the eyes of restic? Same contents, I understand, but what if permissions are different? Timestamps? What happens if content is different - how will restic back it up? What will be restored if I try to restore a which was modified at two different systems at the same day and backed up by restic twice?

I don’t know, all these questions make me wonder if all this extra complexity is really worth the saved space. Of course it depends on the backed up content in the end.

cdhowie · February 10, 2021, 9:26pm

File contents are split into chunks. Each chunk is then deduplicated against all other chunks in the same repository.

Filenames, permissions, and timestamps are data of the directory containing the file, and are not considered part of the file contents. If permissions are different then the parent directory will generate a different tree object.

The same way. If a chunk is not found to already exist in the repository then it will be uploaded.

Restic will restore what you ask it to restore. If you back up two systems to restic then you will wind up with two different snapshot objects. When restoring you have to specify which snapshot you want to restore from.

MichaelEischer · February 11, 2021, 10:11pm

The backup, prune and forget operation completely ignore the --no-lock flag. There’s no way to disable the lock mechanism for these operations.

restic unlock can be used to remove stale lock files. The other operations only remove their own lock file. Please note that restic before 0.10.0 has a bug in the lock file refresh which could allow unlock to delete still used lock files

Which restic version do you use?

I’m not aware that there’s an issue on Github for that specific problem.

wilhelmy · February 12, 2021, 4:23pm

Thanks for the pointers. This is a restic 0.11.0 binary downloaded from github and since this is a brand new setup no other restic release ever wrote to any of my repositories.

I know I can manually remove stale lockfiles, but I don’t see why they should be there in the first place? Maybe having multiple prune jobs run in parallel… I’ve reengineered this part of the glue around restic so that prune for the whole repo only runs on one host, because I figured it’s just an expensive nop with race conditions the way I use it right now. The brief test I ran as mentioned makes me unsure about that — granted, those were backup jobs so those locks are shared ones.

Since I’m dealing with exclusive locks I cannot look at, I think that’s definitely worth its own issue as well. When I checked, restic was running on none of the machines writing to that b2 repository.

MichaelEischer · February 12, 2021, 9:36pm

Normally there are no left-over stale lock files. Especially not when using a local repository. I’ve tried running restic backup . --json --verbose=0 | jq . hundred times in a row, but that didn’t cause left-over lock files for me. Normally stale lock files are caused by interrupting restic or when it’s output is piped into some other command such that restic receives a sigpipe signal.

Only a single prune run per repository should be active at a given time (and no other operation which modifies the repository in any way). The other prune jobs should fail on an already locked repository.

restic cat --no-lock lock <lock-id> works when you use a beta version of restic (Either build the master branch yourself or get a build from restic beta releases (/) if you want to give it a try.).