Persistent repository corruption and data loss

amuckart · May 17, 2021, 1:45am

I am having major problems with persistent corruption/data loss of my restic repo.

Restic version: restic 0.12.0 compiled with go1.15.8 on linux/amd64 I am backing up to a local cloud provider’s S3-compatible object storage. The server I’m backing up is running Debian 9.13, Linux kernel 4.9.0-14-amd64.

Basically every time I run a restic check I end up with fatal errors due to missing data, sometimes for snapshots that were taken weeks ago.

I’m pretty much at my wit’s end here. I can’t see what’s causing the loss of data in the repository and whatever it is, nothing I do will bring the repo into a consistent state where I can get restic check to pass so I have zero confidence that the backups I have are usable or will stay usable.

It particularly concerns me that I’m losing data from snapshots that were taken weeks ago and shouldn’t have been touched since then.

Any suggestions about what is causing this and what I can to do A) fix it and B) have some surety that the backups I’m making are going to stay complete and consistent would be greatly appreciated.

Details below.

Thanks.

This isn’t a full log, but an example - I get the following output from restic check saying there’s missing data:

using temporary cache in /var/backups/restic/cache/restic-check-cache-458589517
repository 920b7fbf opened successfully, password is correct
created new cache in /var/backups/restic/cache/restic-check-cache-458589517
create exclusive lock for repository
load indexes
check all packs
check snapshots, trees and blobs
error for tree 6978d196:3 snapshots
  id 6978d1968ea81f413c53070606dc5747f86026b84c5e02380a51259b1509540f not found in repository
error for tree 4d539e4e:3 snapshots
  id 4d539e4e925d212b6e137826c0093738753650630f5c00bb5090af0298f12cbd not found in repository
[38:18] 100.00%  363 / 363 snapshots
Fatal: repository contains errors

Forgetting and pruning those two snapshots took over 20 hours, but re-running the check afterwards just found another instance of exactly the same error, but with a different, older, snapshot. The tree 7205bfed error below is for a snapshot that was taken 2021-03-21, and I first saw that ID pop up in restic check on 2021-05-14.

pack 63fac959: does not exist
check snapshots, trees and blobs
error for tree 7205bfed:2 snapshots
  id 7205bfed6bbbf20e9c9e527251f50576fdce4920f89e8993c5876957e6fad6e8 not found in repository
[40:54] 100.00%  362 / 362 snapshots
Fatal: repository contains errors

Things I’ve done that haven’t helped:

prune
rebuild-index
rebuild-index --read-all-packs
forget & prune the offending snapshots
check --read-data (after reading Need suggestions on to recover my corrupted repository)

The check with --read-data took 26 hours to run and found 3776 “contained in several indexes” errors and 2831 “pack does not exist” errors.

Running a rebuild-index after the check removed 8210 “not found pack files” and added 382 pack files to indexes. I can’t tally those numbers up with the output of restic check so I have no idea what’s going on there.

The tail end of the rebuild-index:

root@prod-backup1:/var/backups/restic/log# restic rebuild-index
repository 920b7fbf opened successfully, password is correct
loading indexes...
getting pack files to read...
adding pack file to index 006f362f03f73320d8d44ec22da97ccbd703d47645fb066a163b25845fef6fbb
[...381 lines of 'adding pack file to index snipped...]
removing not found pack file 5658a11047a1e00fb9194281522271406fe00ef08ce2b160cf0696d1b5fc876b
[...8209 lines of 'removing not found pack file snipped...]
reading pack files
[0:18] 100.00%  382 / 382 packs
rebuilding index
Save(<index/7633c0d3f2>) returned error, retrying after 552.330144ms: wrote 0 bytes instead of the expected 6824468 bytes
Save(<index/7633c0d3f2>) returned error, retrying after 1.080381816s: wrote 0 bytes instead of the expected 6824468 bytes
Save(<index/abd54bcc9a>) returned error, retrying after 582.280027ms: wrote 0 bytes instead of the expected 6359274 bytes
Save(<index/7633c0d3f2>) returned error, retrying after 1.054928461s: wrote 0 bytes instead of the expected 6824468 bytes
Save(<index/abd54bcc9a>) returned error, retrying after 693.478123ms: wrote 0 bytes instead of the expected 6359274 bytes
Save(<index/8467add92c>) returned error, retrying after 593.411537ms: wrote 0 bytes instead of the expected 6378679 bytes
Save(<index/8467add92c>) returned error, retrying after 424.227764ms: wrote 0 bytes instead of the expected 6378679 bytes
Save(<index/c35e209e15>) returned error, retrying after 328.259627ms: wrote 0 bytes instead of the expected 6174453 bytes
Save(<index/f66bf73faa>) returned error, retrying after 298.484759ms: wrote 0 bytes instead of the expected 5894855 bytes
[1:46] 100.00%  189156 / 189156 packs processed
deleting obsolete index files
[0:20] 100.00%  306 / 306 files deleted
done

I’m assuming because it did eventually finish and not error out that the write errors are transient.

Running restic check immediately after that shows yet more missing data:

root@prod-backup1:/var/backups/restic/log# restic check
using temporary cache in /var/backups/restic/cache/restic-check-cache-634228203
repository 920b7fbf opened successfully, password is correct
created new cache in /var/backups/restic/cache/restic-check-cache-634228203
create exclusive lock for repository
load indexes
check all packs
check snapshots, trees and blobs
error for tree 05561760: snapshots
  id 055617608090d3f371743b89b72bdcfa3bdac7ea452bbd2a1bc2e6f5773e6741 not found in repository
error for tree 7205bfed:5 snapshots
  id 7205bfed6bbbf20e9c9e527251f50576fdce4920f89e8993c5876957e6fad6e8 not found in repository
error for tree db142900:5 snapshots
  id db1429003dbeecec54c220cac5246fff32888101c2216cc161ff9452a53a0f21 not found in repository
error for tree 2298ea2e:5 snapshots
  id 2298ea2e7f3fb07f44c57e760952bc27b9ff47490ae8b0b152218fab15adf759 not found in repository
[37:19] 100.00%  365 / 365 snapshots
Fatal: repository contains errors

Find the offending snapshots with missing trees:

root@prod-backup1:~# cat <<EOF | awk '/for tree/{sub(":",""); print $4}' | xargs /usr/local/sbin/restic_0.12.0_linux_amd64 find --tree
error for tree 05561760: snapshots
  id 055617608090d3f371743b89b72bdcfa3bdac7ea452bbd2a1bc2e6f5773e6741 not found in repository
error for tree 7205bfed:5 snapshots
  id 7205bfed6bbbf20e9c9e527251f50576fdce4920f89e8993c5876957e6fad6e8 not found in repository
error for tree db142900:5 snapshots
  id db1429003dbeecec54c220cac5246fff32888101c2216cc161ff9452a53a0f21 not found in repository
error for tree 2298ea2e:5 snapshots
EOF

repository 920b7fbf opened successfully, password is correct
Unable to load tree 2298ea2e7f3fb07f44c57e760952bc27b9ff47490ae8b0b152218fab15adf759
 ... which belongs to snapshot 203fd8e5c7b69a78221abcbace91f57836d1d95d39fada174bcb2e17fb19344a.
Unable to load tree db1429003dbeecec54c220cac5246fff32888101c2216cc161ff9452a53a0f21
 ... which belongs to snapshot 475f66a4d34307551db46168277fc17cc258a39982ad32efac231c50a4013944.
Unable to load tree 7205bfed6bbbf20e9c9e527251f50576fdce4920f89e8993c5876957e6fad6e8
 ... which belongs to snapshot 886ae8cdf6c83fc0a5a78c6d3dbb759afe6d101900a288b0517a7aa9eeab5e6a.
Unable to load tree 055617608090d3f371743b89b72bdcfa3bdac7ea452bbd2a1bc2e6f5773e6741
 ... which belongs to snapshot d892474b87daad3fe7078fc6ee987fed07483524eadb869fcf4443ac3f274695.

Look at the snapshots:

root@prod-backup1:~# cat <<EOF | awk '/which belongs to/{sub(".$",""); print $NF}' |  xargs /usr/local/sbin/restic_0.12.0_linux_amd64 snapshots 
 ... which belongs to snapshot 203fd8e5c7b69a78221abcbace91f57836d1d95d39fada174bcb2e17fb19344a.
Unable to load tree db1429003dbeecec54c220cac5246fff32888101c2216cc161ff9452a53a0f21
 ... which belongs to snapshot 475f66a4d34307551db46168277fc17cc258a39982ad32efac231c50a4013944.
Unable to load tree 7205bfed6bbbf20e9c9e527251f50576fdce4920f89e8993c5876957e6fad6e8
 ... which belongs to snapshot 886ae8cdf6c83fc0a5a78c6d3dbb759afe6d101900a288b0517a7aa9eeab5e6a.
Unable to load tree 055617608090d3f371743b89b72bdcfa3bdac7ea452bbd2a1bc2e6f5773e6741
 ... which belongs to snapshot d892474b87daad3fe7078fc6ee987fed07483524eadb869fcf4443ac3f274695.
EOF

repository 920b7fbf opened successfully, password is correct
ID        Time                 Host                         Tags                                    Paths
----------------------------------------------------------------------------------------------------------------------------------------------------------
203fd8e5  2021-03-15 20:49:35  prod-image1   prod-backup1,2021-03-15  /var/backups/dirvish/spool/prod-image1
886ae8cd  2021-03-21 20:28:56  prod-proxy1   prod-backup1,2021-03-21  /var/backups/dirvish/spool/prod-proxy1
475f66a4  2021-04-10 20:37:35  prod-proxy1   prod-backup1,2021-04-10  /var/backups/dirvish/spool/prod-proxy1
d892474b  2021-04-23 20:34:59  prod-pubapp1  prod-backup1,2021-04-23  /var/backups/dirvish/spool/prod-pubapp1
----------------------------------------------------------------------------------------------------------------------------------------------------------
4 snapshots

And it’s a new list of snapshots different to the last time I ran restic check and none of them are recent!

fd0 · May 17, 2021, 7:54am

Huh, that sounds odd indeed! It looks like somehow files uploaded to the repo vanish (or were never uploaded correctly in the first place). Random thought: Is it maybe possible that the account for the provider is configured in a way to automatically remove older files?

Do you also experience the same issues when using a different provider? Or even a repository stored in a local directory? We need to find out if the issue is caused by restic, the storage backend, or an interaction of both.

Apart from that, it may be helpful if you could run check again, then find the file the chunk of data should be stored in via restic find --show-pack-id --tree <id> (also try with --blob if the blob is not found), e.g.:

$ restic find --show-pack-id --tree 56091245379f5f2c817ce1497dc80703fbe4b40e2a1907c6b184565fa2883014

Then check manually that the file data/56/56091245379f5f2c817ce1497dc80703fbe4b40e2a1907c6b184565fa2883014 really is in the repo.

You can also try to configure the rclone backend and access your cloud provider’s service via that. The implementation of the s3 backend is different in rclone and restic, so if rclone works, we likely have a bug in the s3 backend.

Maybe others here also have an idea what may be the cause for your issues.

amuckart · May 21, 2021, 5:59am

Thank you for your reply. I’ve been fighting with some other things the last few days but I will try those suggestions.

dpfeilsticker · May 25, 2021, 7:24am

Do you run backup, check forget an prune with different users?!

Al files written by restic are owned and readable only for the user running the command.
So with different users for backup and check this is exactly what you get.
For example rest-server for backup writing as www-data. Or as root.
Check running as “you” (cant read www-data or root files).
prune running as root, writing ne files only root can read…

a) chown -R
b) run al restic tasks only as one usere

PS: Or do you write your backup to a zfs-file-system without ecc-ram?!
In this case any single failure in your ram will corrupt your filesystem while you don’t see any error in your disk-smart-values…

amuckart · June 27, 2021, 2:22am

Nope, same user every time. Underlying storage is not ZFS.

amuckart · July 13, 2021, 9:57pm

I ended up creating a whole new bucket and repeating the backups. I couldn’t resolve the issues but they haven’t happened again so far.

I have wrapped checks around restic in the backup script I’m using to ensure there’s sufficient disk space and that checks pass before the backup runs.