Restic locking not possible on some clients

Dear all,

we are using restic now for about half a year in production and are very satisfied so far. It’s convenient, easy to use and fast.

We have about 1,500 restic clients which all use the same S3 storage to save restic snapshots. Each client has its own restic repository on the S3. We wrote small wrappers for backup/restore/etc are sourcing a configuration file located on the client itself which defines some variables like these:

export RESTIC_PASSWORD=*********************************************
export RESTIC_REPOSITORY_BASE=s3:https://<FQDN_of_S3_server>/<S3_BUCKET>/rr
export RESTIC_REPOSITORY=${RESTIC_REPOSITORY_BASE}-<client_hostname>
export RESTIC_REPOSITORY_SHORTNAME=rr-<client_hostname>

export AWS_ACCESS_KEY_ID=***********************************************
export AWS_SECRET_ACCESS_KEY=***********************************************
export AWS_ENDPOINT_URL=https://<FQDN_of_S3_server>
export AWS_BUCKET=s3://<S3_BUCKET>/rr-<client>

export TENANT=<tenant_name>
export SERVICE=<service_name>
export BACKUP_DIR=/var/lib/jenkins/master
export EXCLUDE_FILES="--exclude /var/lib/jenkins/master/caches --exclude /var/lib/jenkins/master/workspace"
export RETENTION_DAYS=10

do_backup(){
   logger "backing up tenant ${TENANT} / service ${SERVICE}"
   restic backup \
      ${EXCLUDE_FILES} \
      --verbose \
      --host ${host} \
      --tag ${TENANT} ${BACKUP_DIR} 2>&1 >> $logFile
   rcRestic=$?
}

I have now the strange situation where the backup fails for some reason which was not clear from the log what happened:

2024-01-30 04:51:39] dot-restic-backup.sh: backup completed: there was a fatal error with backup (no snapshot created)

I then did a manual backup like this:

restic backup --verbose --host <client_hostname> --tag <client_tag> /var/lib/jenkins/master

which producted the following output:

open repository
repository bfd50002 opened (version 2, compression level auto)
lock repository
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 264.548424ms: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 504.084248ms: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 789.516851ms: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 1.208177748s: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 3.187347333s: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 5.085535343s: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 4.82923451s: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 4.53371898s: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 15.720188005s: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 17.830241804s: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 544.303901ms: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 1.096046483s: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 1.268239083s: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 2.193189615s: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 1.617410165s: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 3.797922313s: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 6.918690894s: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 6.321951436s: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 6.654076796s: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 13.952480804s: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 593.510552ms: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 402.192182ms: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 1.607434279s: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 1.18157902s: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 3.740422932s: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 3.789983103s: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 7.432039314s: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 5.650148688s: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 7.436425516s: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 22.812015608s: The specified key does not exist.
unable to create lock in backend: The specified key does not exist.

That sounds like restic is having problems creating the lock on the repo.

I then tried other machines and other commands. All errors occuring have one in common: They fail when they try to create a lock on the repository:

root@<client_hostname>:~# restic cat config
repository bfd50002 opened (version 2, compression level auto)
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 369.025955ms: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 524.253019ms: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 1.633300767s: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 2.004477789s: The specified key does not exist.
[...]
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 10.490272322s: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 22.586281609s: The specified key does not exist.
unable to create lock in backend: The specified key does not exist.

or like this:

root@<client_hostname>:~# restic list locks
repository bfd50002 opened (version 2, compression level auto)
1ffe828352b54929462ccd788d7cd3279f56496b252bd48ce11b016020c47e65
root@<client_hostname>:~# restic unlock
repository bfd50002 opened (version 2, compression level auto)
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 651.162656ms: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 648.236312ms: The specified key does not exist.
Load(<lock/1ffe828352>, 0, 0) returned error, retrying after 802.909126ms: The specified key does not exist.

It seems that the restic unlock command does not work anymore on some of my backup clients. I tried to find the error message

The specified key does not exist

in the restic code, but it’s not contained in the code.

My questions here are now the following

  • has anybody an idea why the locking mechanism stopped working (which prevents a lot of my wrappers from working because some of them are using locking mechanism)
  • where does the error message come from?

It seems to help to use restic unlock --remove-all. Example from another machine:

root@<client_hostname>:~# restic unlock --remove-all
repository 6a87315d opened (version 2, compression level auto)
successfully removed 1 locks

Afterwards the backup was running fine again. And also the ability to get information about snapshots was running fine again:

root@<client_hostname>:~# restic snapshots
repository 6a87315d opened (version 2, compression level auto)
ID        Time                 Host                       Tags        Paths
-----------------------------------------------------------------------------------------------------------
9bf3bb82  2023-12-13 05:11:03  <client_hostname>  dhcp1       /<client_hostname>.sql
5d38f5b1  2023-12-14 05:11:02  <client_hostname>  dhcp1       /<client_hostname>.sql
2b7d16a9  2023-12-15 05:11:02  <client_hostname>  dhcp1       /<client_hostname>.sql
36fe2fff  2023-12-16 05:11:02  <client_hostname>  dhcp1       /<client_hostname>.sql
819da5f0  2023-12-17 05:11:02  <client_hostname>  dhcp1       /<client_hostname>.sql
a272f54e  2023-12-18 05:11:03  <client_hostname>  dhcp1       /<client_hostname>.sql
e6f214ae  2023-12-19 05:11:02  <client_hostname>  dhcp1       /<client_hostname>.sql
de05329c  2023-12-20 05:11:02  <client_hostname>  dhcp1       /<client_hostname>.sql
61d3d2f2  2023-12-21 05:11:02  <client_hostname>  dhcp1       /<client_hostname>.sql
6d73ee43  2023-12-22 05:11:03  <client_hostname>  dhcp1       /<client_hostname>.sql
-----------------------------------------------------------------------------------------------------------
10 snapshots

Many thanks in advance and best greetings,
Joachim.

PS: But there are also backup clients where the trick with the manual restic unlock --remove-all is not working:

root@<client_hostname>:~# restic unlock --remove-all
repository f18c3436 opened (version 2, compression level auto)
root@<client_hostname>:~# restic backup    ${EXCLUDE_FILES}    --verbose    --host ${host}    --tag ${TENANT} ${BACKUP_DIR} 2>&1
open repository
repository f18c3436 opened (version 2, compression level auto)
lock repository
no parent snapshot found, will read all files
load index files
Load(<index/4f0486958d>, 0, 0) returned error, retrying after 647.381076ms: The specified key does not exist.
Load(<index/4f0486958d>, 0, 0) returned error, retrying after 530.309941ms: The specified key does not exist.
Load(<index/4f0486958d>, 0, 0) returned error, retrying after 758.979986ms: The specified key does not exist.
Load(<index/4f0486958d>, 0, 0) returned error, retrying after 1.874951947s: The specified key does not exist.
Load(<index/4f0486958d>, 0, 0) returned error, retrying after 2.530114264s: The specified key does not exist.
Load(<index/4f0486958d>, 0, 0) returned error, retrying after 2.622807819s: The specified key does not exist.
Load(<index/4f0486958d>, 0, 0) returned error, retrying after 3.638754979s: The specified key does not exist.
Load(<index/4f0486958d>, 0, 0) returned error, retrying after 9.060368354s: The specified key does not exist.
Load(<index/4f0486958d>, 0, 0) returned error, retrying after 13.388024452s: The specified key does not exist.
Load(<index/4f0486958d>, 0, 0) returned error, retrying after 22.847250941s: The specified key does not exist.
The specified key does not exist.

What on earth is meant with The specified key does not exist.?

Thanks again!
Joachim.

And again a PS: I found now that a repair approach of the index files in the repositories helps sometimes:

root@<client_hostname>:~# restic repair index
repository 23d3db81 opened (version 2, compression level auto)
loading indexes...
Load(<index/12a74d6ac2>, 0, 0) returned error, retrying after 406.376345ms: The specified key does not exist.
Load(<index/806c291469>, 0, 0) returned error, retrying after 296.163674ms: The specified key does not exist.
Load(<index/806c291469>, 0, 0) returned error, retrying after 778.400035ms: The specified key does not exist.
Load(<index/12a74d6ac2>, 0, 0) returned error, retrying after 748.075325ms: The specified key does not exist.
Load(<index/806c291469>, 0, 0) returned error, retrying after 617.996066ms: The specified key does not exist.
Load(<index/12a74d6ac2>, 0, 0) returned error, retrying after 1.281816865s: The specified key does not exist.
Load(<index/806c291469>, 0, 0) returned error, retrying after 901.249107ms: The specified key does not exist.
Load(<index/12a74d6ac2>, 0, 0) returned error, retrying after 975.225767ms: The specified key does not exist.
Load(<index/806c291469>, 0, 0) returned error, retrying after 3.053775574s: The specified key does not exist.
Load(<index/12a74d6ac2>, 0, 0) returned error, retrying after 3.026802398s: The specified key does not exist.
Load(<index/806c291469>, 0, 0) returned error, retrying after 4.577012343s: The specified key does not exist.
Load(<index/12a74d6ac2>, 0, 0) returned error, retrying after 4.14878225s: The specified key does not exist.
Load(<index/806c291469>, 0, 0) returned error, retrying after 7.246450042s: The specified key does not exist.
Load(<index/12a74d6ac2>, 0, 0) returned error, retrying after 4.367107444s: The specified key does not exist.
Load(<index/12a74d6ac2>, 0, 0) returned error, retrying after 11.455751805s: The specified key does not exist.
Load(<index/806c291469>, 0, 0) returned error, retrying after 9.929463598s: The specified key does not exist.
Load(<index/12a74d6ac2>, 0, 0) returned error, retrying after 11.413377258s: The specified key does not exist.
Load(<index/806c291469>, 0, 0) returned error, retrying after 19.132774987s: The specified key does not exist.
Load(<index/12a74d6ac2>, 0, 0) returned error, retrying after 11.367906165s: The specified key does not exist.
Load(<index/806c291469>, 0, 0) returned error, retrying after 15.690784001s: The specified key does not exist.
removing invalid index 12a74d6ac2228825556ed1a875f3fc3ae457cdb27a727d4bd5937d9077928153: The specified key does not exist.
removing invalid index 806c2914692eca4a37b6d0550ea0620356287be9cadb929051daf8cd07200197: The specified key does not exist.
getting pack files to read...
rebuilding index
[0:00] 100.00%  34 / 34 packs processed
deleting obsolete index files
[0:00] 100.00%  3 / 3 files deleted
done

Afterwards the repository works again as expected. What could have caused this index corruption? I have about 250 clients where the backup started failing with these error messages on exactly the same day. Any ideas?

Again, best greetings and thanks again,
Joachim.

The The specified key does not exist message comes from the Minio S3 library, the source of it can be seen here: https://github.com/minio/minio-go/blob/master/s3-error.go#L39

Which restic version are you using when seeing this problem?

What software or service is the S3 you are using?

Hey @rawtaz,

thanks for the answer!

I’m using restic_0.16.0_linux_amd64 at the moment. Unfortunately I can’t say which software is used on the S3 side as it’s a service where I just ordered a bucket of 50 TB for our backups. All I know is that it’s completely Amazon S3 compatible. It’s hosted on-prem in our own datacentre.

S3 is actually a key-value store, that is you can read/write data identified by a key (probably better known as path). Folders don’t really exist in S3, clients just synthesize those for convenience.

In case of locking a restic repository the The specified key does not exist. error can have two causes:

  • multiple clients that concurrently access a repository can result in a situation where restic first lists all lock files and then tries to read each one of then. In the meantime another client removes its lock, such that the first client fails to retrieve the lock file (as it has been deleted).
  • The S3 storage returns a file list that includes files that are not actually there. This is what seems to happen here. Restic made three attempts to access Load(<lock/1ffe828352>, 0, 0) (the timeouts start three times from a few hundred milliseconds). Inbetween these attempts restic updates the lock file listing. Thus, based on the log output we can rule out the first cause.

This means that the S3 storage claims that certain files exist, but is unable to return them upon request. Restic absolutely requires consistent file listings (i.e., all listed files must exist). This must be fixed on the S3 storage side.

1 Like