Hello! I’ve gotten a few unreproducible consistency errors and am hoping to verify potential points of failure and get diagnosing suggestions. If others think it’s a good idea I may be able to write a chapter in the manual about this topic.
I have gotten Hash does not match id during prune
during prune
(details at Hash does not match id during prune · Issue #1999 · restic/restic · GitHub), Pack ID does not match
during check
, and Blob ID does not match
during check
.
@fd0 wrote the following about the last two of these errors (Dropped packets while backing up causes errors in archive - #3 by fd0):
Pack ID does not match
This is an error on the outermost level: restic requested a file from B2 for which it knows the SHA256 hash of the contents, but got something with a different hash back. The data might have been modified at rest, during transit, in memory of the machine, or even at backup time before it could be saved to B2.
Blob ID does not match
restic requested the pack 82b0562f and the hash of the contents matched the file name, so the data was not modified in transit or at rest. But: a part of the (encrypted) data has been modified, which could only have happened between restic encrypting the data (so called blob, pack files contain one or more of these) and saving the data to a temp file before uploading to B2. In theory, it could also have happened during check, but you can easily test that for yourself:
Find the complete filename for the pack: restic list packs | grep ‘^82b0562f’
Download the pack and check its hash :restic cat pack | sha256sum
In my case, running each of the problematic commands again results in the error not occurring, so I think the “in theory” clause above is what’s happening.
$ restic version
restic 0.11.0 compiled with go1.15.3 on linux/amd64
$ sudo -u restic --preserve-env=B2_ACCOUNT_ID,B2_ACCOUNT_KEY restic -r b2:bucket-name -o b2.connections=15 check --read-data-subset=11/100
...
pack 6e320160 contains 1 errors: [Blob ID does not match, want 090726a0, got f97598b3]
# Ran another 3 times with no errors
# The desired blob ID in the check commands is correct:
$ sudo -u restic --preserve-env=B2_ACCOUNT_ID,B2_ACCOUNT_KEY restic -r b2:bucket-name list blobs | grep ' 090726a0'
...
data 090726a0...
$ sudo -u restic --preserve-env=B2_ACCOUNT_ID,B2_ACCOUNT_KEY restic -r b2:bucket-name list blobs | grep ' f97598b3'
enter password for repository:
# no output, blob not found
$
$ sudo -u restic --preserve-env=B2_ACCOUNT_ID,B2_ACCOUNT_KEY restic -r b2:bucket-name check --read-data-subset=2/10
...
Pack ID does not match, want 51aa2bf4, got a7a09216
$ sudo -u restic --preserve-env=B2_ACCOUNT_ID,B2_ACCOUNT_KEY restic -r b2:bucket-name cat pack 51aa2bf4... | sha256sum
51aa2bf4...
# Success, pack does not appear corrupt on second download
Of the issues I’m seeing, Blob ID does not match
during check
appears to be the most specific. As I understand it, this means restic
pulled the pack from the repository, checked its hash and found it to be intact. Then, after decrypting the pack header and one of the blobs, the hash of the decrypted blob did not match the one written to the header. The above list blobs
indicates the hash in the header appears to be the correct one as the computed one does not appear in any pack headers.
So what do we know about where this issue is occurring? First, because check
should not be fixing anything and subsequent invocations run OK, the issue appears to be transient. The data at rest is probably OK. Second, because the pack is being hashed after being pulled from the repository, it seems to be intact with no modification at rest or in transit.
Everything points to an issue on my local machine, either in software or hardware, with decrypting or hashing packs/blobs.
I understand hardware fails and that may well be the case here. The system is a Supermicro motherboard, Xeon processor, and ECC RAM. It runs ext4 and swap on an Intel SSD, and ZFS raidz2 on an attached array. I have tried a number of things to confirm a hardware issue with online stress testing software:
mprime
blend torture test for 1 hourlinpack
using 50% RAM for 30 minsstressapptest -s 300 -M 20000
(20/32GB test for 5 mins)stressapptest -s 300 -M 20000 -C
(20/32GB CPU-stressful test for 5 mins)stressapptest -s 300 -M 20000 -f sat1 -f sat2
(stresses disk IO path used for swap)- Monthly ZFS scrubs have never found an error
- SMART data on all disks looks reasonable
- The kernel log does not show any machine check errors
rasdaemon
andras-mc-ctl
show no memory errors, but it does show possible disk errors on the SSD. Because the disk error stuff inrasdaemon
is experimental and the errors don’t appear to be decoded properly, I’m reluctant to believe them. At least for the lastBlob ID does not match
error, no disk error was recorded at the time of the error.
A few questions I could use help with:
- Are there any errors in my understanding above?
- With the exception of swap, could anything in the disk IO path be involved?
/tmp
is mountedtmpfs
. I guess the binary is loaded from disk, but I think that’s far more likely to cause a segfault than corrupt hashing at exactly the same spot. - How can I stress this part of
restic
without using internet bandwidth? I can try getting the index and pack locally and loopingdebug examine --extract-pack
. If I’m lucky that returns a non-zero error code when the equivalent ofBlob ID does not match
occurs.
Thanks for any suggestions!