I am having major problems with persistent corruption/data loss of my restic repo.
Restic version: restic 0.12.0 compiled with go1.15.8 on linux/amd64
I am backing up to a local cloud provider’s S3-compatible object storage. The server I’m backing up is running Debian 9.13, Linux kernel 4.9.0-14-amd64.
Basically every time I run a restic check
I end up with fatal errors due to missing data, sometimes for snapshots that were taken weeks ago.
I’m pretty much at my wit’s end here. I can’t see what’s causing the loss of data in the repository and whatever it is, nothing I do will bring the repo into a consistent state where I can get restic check
to pass so I have zero confidence that the backups I have are usable or will stay usable.
It particularly concerns me that I’m losing data from snapshots that were taken weeks ago and shouldn’t have been touched since then.
Any suggestions about what is causing this and what I can to do A) fix it and B) have some surety that the backups I’m making are going to stay complete and consistent would be greatly appreciated.
Details below.
Thanks.
This isn’t a full log, but an example - I get the following output from restic check saying there’s missing data:
using temporary cache in /var/backups/restic/cache/restic-check-cache-458589517
repository 920b7fbf opened successfully, password is correct
created new cache in /var/backups/restic/cache/restic-check-cache-458589517
create exclusive lock for repository
load indexes
check all packs
check snapshots, trees and blobs
error for tree 6978d196:3 snapshots
id 6978d1968ea81f413c53070606dc5747f86026b84c5e02380a51259b1509540f not found in repository
error for tree 4d539e4e:3 snapshots
id 4d539e4e925d212b6e137826c0093738753650630f5c00bb5090af0298f12cbd not found in repository
[38:18] 100.00% 363 / 363 snapshots
Fatal: repository contains errors
Forgetting and pruning those two snapshots took over 20 hours, but re-running the check afterwards just found another instance of exactly the same error, but with a different, older, snapshot. The tree 7205bfed error below is for a snapshot that was taken 2021-03-21, and I first saw that ID pop up in restic check on 2021-05-14.
pack 63fac959: does not exist
check snapshots, trees and blobs
error for tree 7205bfed:2 snapshots
id 7205bfed6bbbf20e9c9e527251f50576fdce4920f89e8993c5876957e6fad6e8 not found in repository
[40:54] 100.00% 362 / 362 snapshots
Fatal: repository contains errors
Things I’ve done that haven’t helped:
- prune
- rebuild-index
- rebuild-index --read-all-packs
- forget & prune the offending snapshots
- check --read-data (after reading Need suggestions on to recover my corrupted repository)
The check with --read-data took 26 hours to run and found 3776 “contained in several indexes” errors and 2831 “pack does not exist” errors.
Running a rebuild-index after the check removed 8210 “not found pack files” and added 382 pack files to indexes. I can’t tally those numbers up with the output of restic check
so I have no idea what’s going on there.
The tail end of the rebuild-index:
root@prod-backup1:/var/backups/restic/log# restic rebuild-index
repository 920b7fbf opened successfully, password is correct
loading indexes...
getting pack files to read...
adding pack file to index 006f362f03f73320d8d44ec22da97ccbd703d47645fb066a163b25845fef6fbb
[...381 lines of 'adding pack file to index snipped...]
removing not found pack file 5658a11047a1e00fb9194281522271406fe00ef08ce2b160cf0696d1b5fc876b
[...8209 lines of 'removing not found pack file snipped...]
reading pack files
[0:18] 100.00% 382 / 382 packs
rebuilding index
Save(<index/7633c0d3f2>) returned error, retrying after 552.330144ms: wrote 0 bytes instead of the expected 6824468 bytes
Save(<index/7633c0d3f2>) returned error, retrying after 1.080381816s: wrote 0 bytes instead of the expected 6824468 bytes
Save(<index/abd54bcc9a>) returned error, retrying after 582.280027ms: wrote 0 bytes instead of the expected 6359274 bytes
Save(<index/7633c0d3f2>) returned error, retrying after 1.054928461s: wrote 0 bytes instead of the expected 6824468 bytes
Save(<index/abd54bcc9a>) returned error, retrying after 693.478123ms: wrote 0 bytes instead of the expected 6359274 bytes
Save(<index/8467add92c>) returned error, retrying after 593.411537ms: wrote 0 bytes instead of the expected 6378679 bytes
Save(<index/8467add92c>) returned error, retrying after 424.227764ms: wrote 0 bytes instead of the expected 6378679 bytes
Save(<index/c35e209e15>) returned error, retrying after 328.259627ms: wrote 0 bytes instead of the expected 6174453 bytes
Save(<index/f66bf73faa>) returned error, retrying after 298.484759ms: wrote 0 bytes instead of the expected 5894855 bytes
[1:46] 100.00% 189156 / 189156 packs processed
deleting obsolete index files
[0:20] 100.00% 306 / 306 files deleted
done
I’m assuming because it did eventually finish and not error out that the write errors are transient.
Running restic check
immediately after that shows yet more missing data:
root@prod-backup1:/var/backups/restic/log# restic check
using temporary cache in /var/backups/restic/cache/restic-check-cache-634228203
repository 920b7fbf opened successfully, password is correct
created new cache in /var/backups/restic/cache/restic-check-cache-634228203
create exclusive lock for repository
load indexes
check all packs
check snapshots, trees and blobs
error for tree 05561760: snapshots
id 055617608090d3f371743b89b72bdcfa3bdac7ea452bbd2a1bc2e6f5773e6741 not found in repository
error for tree 7205bfed:5 snapshots
id 7205bfed6bbbf20e9c9e527251f50576fdce4920f89e8993c5876957e6fad6e8 not found in repository
error for tree db142900:5 snapshots
id db1429003dbeecec54c220cac5246fff32888101c2216cc161ff9452a53a0f21 not found in repository
error for tree 2298ea2e:5 snapshots
id 2298ea2e7f3fb07f44c57e760952bc27b9ff47490ae8b0b152218fab15adf759 not found in repository
[37:19] 100.00% 365 / 365 snapshots
Fatal: repository contains errors
Find the offending snapshots with missing trees:
root@prod-backup1:~# cat <<EOF | awk '/for tree/{sub(":",""); print $4}' | xargs /usr/local/sbin/restic_0.12.0_linux_amd64 find --tree
error for tree 05561760: snapshots
id 055617608090d3f371743b89b72bdcfa3bdac7ea452bbd2a1bc2e6f5773e6741 not found in repository
error for tree 7205bfed:5 snapshots
id 7205bfed6bbbf20e9c9e527251f50576fdce4920f89e8993c5876957e6fad6e8 not found in repository
error for tree db142900:5 snapshots
id db1429003dbeecec54c220cac5246fff32888101c2216cc161ff9452a53a0f21 not found in repository
error for tree 2298ea2e:5 snapshots
EOF
repository 920b7fbf opened successfully, password is correct
Unable to load tree 2298ea2e7f3fb07f44c57e760952bc27b9ff47490ae8b0b152218fab15adf759
... which belongs to snapshot 203fd8e5c7b69a78221abcbace91f57836d1d95d39fada174bcb2e17fb19344a.
Unable to load tree db1429003dbeecec54c220cac5246fff32888101c2216cc161ff9452a53a0f21
... which belongs to snapshot 475f66a4d34307551db46168277fc17cc258a39982ad32efac231c50a4013944.
Unable to load tree 7205bfed6bbbf20e9c9e527251f50576fdce4920f89e8993c5876957e6fad6e8
... which belongs to snapshot 886ae8cdf6c83fc0a5a78c6d3dbb759afe6d101900a288b0517a7aa9eeab5e6a.
Unable to load tree 055617608090d3f371743b89b72bdcfa3bdac7ea452bbd2a1bc2e6f5773e6741
... which belongs to snapshot d892474b87daad3fe7078fc6ee987fed07483524eadb869fcf4443ac3f274695.
Look at the snapshots:
root@prod-backup1:~# cat <<EOF | awk '/which belongs to/{sub(".$",""); print $NF}' | xargs /usr/local/sbin/restic_0.12.0_linux_amd64 snapshots
... which belongs to snapshot 203fd8e5c7b69a78221abcbace91f57836d1d95d39fada174bcb2e17fb19344a.
Unable to load tree db1429003dbeecec54c220cac5246fff32888101c2216cc161ff9452a53a0f21
... which belongs to snapshot 475f66a4d34307551db46168277fc17cc258a39982ad32efac231c50a4013944.
Unable to load tree 7205bfed6bbbf20e9c9e527251f50576fdce4920f89e8993c5876957e6fad6e8
... which belongs to snapshot 886ae8cdf6c83fc0a5a78c6d3dbb759afe6d101900a288b0517a7aa9eeab5e6a.
Unable to load tree 055617608090d3f371743b89b72bdcfa3bdac7ea452bbd2a1bc2e6f5773e6741
... which belongs to snapshot d892474b87daad3fe7078fc6ee987fed07483524eadb869fcf4443ac3f274695.
EOF
repository 920b7fbf opened successfully, password is correct
ID Time Host Tags Paths
----------------------------------------------------------------------------------------------------------------------------------------------------------
203fd8e5 2021-03-15 20:49:35 prod-image1 prod-backup1,2021-03-15 /var/backups/dirvish/spool/prod-image1
886ae8cd 2021-03-21 20:28:56 prod-proxy1 prod-backup1,2021-03-21 /var/backups/dirvish/spool/prod-proxy1
475f66a4 2021-04-10 20:37:35 prod-proxy1 prod-backup1,2021-04-10 /var/backups/dirvish/spool/prod-proxy1
d892474b 2021-04-23 20:34:59 prod-pubapp1 prod-backup1,2021-04-23 /var/backups/dirvish/spool/prod-pubapp1
----------------------------------------------------------------------------------------------------------------------------------------------------------
4 snapshots
And it’s a new list of snapshots different to the last time I ran restic check
and none of them are recent!