Troubleshooting unreproducible consistency errors

Hello! I’ve gotten a few unreproducible consistency errors and am hoping to verify potential points of failure and get diagnosing suggestions. If others think it’s a good idea I may be able to write a chapter in the manual about this topic.

I have gotten Hash does not match id during prune during prune (details at Hash does not match id during prune · Issue #1999 · restic/restic · GitHub), Pack ID does not match during check, and Blob ID does not match during check.

@fd0 wrote the following about the last two of these errors (Dropped packets while backing up causes errors in archive - #3 by fd0):

Pack ID does not match
This is an error on the outermost level: restic requested a file from B2 for which it knows the SHA256 hash of the contents, but got something with a different hash back. The data might have been modified at rest, during transit, in memory of the machine, or even at backup time before it could be saved to B2.

Blob ID does not match
restic requested the pack 82b0562f and the hash of the contents matched the file name, so the data was not modified in transit or at rest. But: a part of the (encrypted) data has been modified, which could only have happened between restic encrypting the data (so called blob, pack files contain one or more of these) and saving the data to a temp file before uploading to B2. In theory, it could also have happened during check, but you can easily test that for yourself:
Find the complete filename for the pack: restic list packs | grep ‘^82b0562f’
Download the pack and check its hash :restic cat pack | sha256sum

In my case, running each of the problematic commands again results in the error not occurring, so I think the “in theory” clause above is what’s happening.

$ restic version
restic 0.11.0 compiled with go1.15.3 on linux/amd64

$ sudo -u restic --preserve-env=B2_ACCOUNT_ID,B2_ACCOUNT_KEY restic -r b2:bucket-name -o b2.connections=15 check --read-data-subset=11/100
...
pack 6e320160 contains 1 errors: [Blob ID does not match, want 090726a0, got f97598b3]
# Ran another 3 times with no errors

# The desired blob ID in the check commands is correct:
$ sudo -u restic --preserve-env=B2_ACCOUNT_ID,B2_ACCOUNT_KEY restic -r b2:bucket-name list blobs | grep ' 090726a0'
...
data 090726a0...

$ sudo -u restic --preserve-env=B2_ACCOUNT_ID,B2_ACCOUNT_KEY restic -r b2:bucket-name list blobs | grep ' f97598b3'
enter password for repository:
# no output, blob not found
$

$ sudo -u restic --preserve-env=B2_ACCOUNT_ID,B2_ACCOUNT_KEY restic -r b2:bucket-name check --read-data-subset=2/10
...
Pack ID does not match, want 51aa2bf4, got a7a09216

$ sudo -u restic --preserve-env=B2_ACCOUNT_ID,B2_ACCOUNT_KEY restic -r b2:bucket-name cat pack 51aa2bf4... | sha256sum
51aa2bf4...

# Success, pack does not appear corrupt on second download

Of the issues I’m seeing, Blob ID does not match during check appears to be the most specific. As I understand it, this means restic pulled the pack from the repository, checked its hash and found it to be intact. Then, after decrypting the pack header and one of the blobs, the hash of the decrypted blob did not match the one written to the header. The above list blobs indicates the hash in the header appears to be the correct one as the computed one does not appear in any pack headers.

So what do we know about where this issue is occurring? First, because check should not be fixing anything and subsequent invocations run OK, the issue appears to be transient. The data at rest is probably OK. Second, because the pack is being hashed after being pulled from the repository, it seems to be intact with no modification at rest or in transit.

Everything points to an issue on my local machine, either in software or hardware, with decrypting or hashing packs/blobs.

I understand hardware fails and that may well be the case here. The system is a Supermicro motherboard, Xeon processor, and ECC RAM. It runs ext4 and swap on an Intel SSD, and ZFS raidz2 on an attached array. I have tried a number of things to confirm a hardware issue with online stress testing software:

  • mprime blend torture test for 1 hour
  • linpack using 50% RAM for 30 mins
  • stressapptest -s 300 -M 20000 (20/32GB test for 5 mins)
  • stressapptest -s 300 -M 20000 -C (20/32GB CPU-stressful test for 5 mins)
  • stressapptest -s 300 -M 20000 -f sat1 -f sat2 (stresses disk IO path used for swap)
  • Monthly ZFS scrubs have never found an error
  • SMART data on all disks looks reasonable
  • The kernel log does not show any machine check errors
  • rasdaemon and ras-mc-ctl show no memory errors, but it does show possible disk errors on the SSD. Because the disk error stuff in rasdaemon is experimental and the errors don’t appear to be decoded properly, I’m reluctant to believe them. At least for the last Blob ID does not match error, no disk error was recorded at the time of the error.

A few questions I could use help with:

  1. Are there any errors in my understanding above?
  2. With the exception of swap, could anything in the disk IO path be involved? /tmp is mounted tmpfs. I guess the binary is loaded from disk, but I think that’s far more likely to cause a segfault than corrupt hashing at exactly the same spot.
  3. How can I stress this part of restic without using internet bandwidth? I can try getting the index and pack locally and looping debug examine --extract-pack. If I’m lucky that returns a non-zero error code when the equivalent of Blob ID does not match occurs.

Thanks for any suggestions!

2 Likes

This isn’t directly responsive to any of your specific questions, but one suggestion I’d make is to try running your stress test(s) for much longer periods.

I had similar issues some time ago which I eventually traced to an intermittent memory error. I could produce the error reliably only when running a stress test (mprime/memtest86+/linpack in my case) overnight or longer (many passes). Running a stress test for only 5 minutes or 30 minutes was (speaking again to my case) unlikely to produce an error even though I commonly saw a restic consistency error over comparable periods of running time.

I was (and am) unable to defend this observation rationally without serious hand-waving…nevertheless, it was my observation. So extended stress testing might be worth a try…at the very least it could improve confidence in your hardware,

2 Likes

That’s a pretty comprehensive report, thanks!

I can see none, I think you understood it pretty well. In order for the Blob ID does not match error to occur the following needs to happen:

  1. Restic downloads the file from the backend to a tempfile
  2. The (outer) SHA256 hash matches the filename (otherwise you’d get the Pack ID does not match error)
  3. Restic then decrypts the file header without an error
  4. Restic decrypts the blob without an error
  5. It then hashes the plaintext, and only then sees that the hash does not match the expected one

That the error you saw is transient is very odd indeed. We had reports of such behaviour before (in the thread you mentioned), but there the error wasn’t transient.

I noticed, though, that you used the --read-data-subset flag to only test a subset of the files in the repo, and the values are different in both invocations, so it can happen that the file in question is not checked in the second run. Can you retry with exactly the same flags as before (--read-data-subset=11/100)?

I don’t thinks so. Since restic got the correct data first (the outer SHA256 hash matches), I don’t think it’s a hardware issue :confused:

It should be the same for a local repository, apart from the code path which fetches data from B2. So you could try with a local repository…

Thanks! I forgot to mention that I was doing just that :slight_smile: I’d like to offline the machine to run memtest86 for 24 hours but that is difficult. I am however running mprime blend torture test (been running for 24 hours now). I concurrently ran two instances of dledford-memtest as memtest.sh -t $PWD/data -s linux-5.10.tar.xz -n 9999, one for each disk system. That took about 2 hours and completed the 10k iterations.

Yep, I ran that exact command another 3 times and the error did not occur. No changes (no backups or prune) wrote to the repository during that time, or since then. The subset selection n/k in 0.11.0 is deterministic, right?

I should have mentioned that the example in the original post is just one of currently 3 Blob ID does not match errors I have seen during different subset selections on the same repository. All looked exactly the same as in the example, though I only ran the exact command again once or twice. Each appeared to have the correct desired blob ID. None of the blob IDs were the same.

:open_mouth: Not what I was expecting! We’ll track it down though, whether software or hardware.

I will work on partially cloning the repository such that I can try to loop --extract-pack locally and hopefully trigger it more reliably. Do you have suggestions on the easiest way to stop execution and dump the involved structures when I’m able to trigger it again?

I thought it was appropriate to put this in the forum since I thought it was likely a hardware issue, but if you want I can move it to the issue tracker.

Thanks!

The subset selection is based on the first the characters of the pack filename so it’s deterministic.

Cool, I’m up for it! It has been bugging me for a long time, and I’m eager to get to the bottom of it.

We can do it here for now, let’s see if something turns up.

I’d need to build something specifically to track this down, but that’s not a problem. My idea is to add code to the checker so that it dumps all relevant data to /tmp so we can then analyze it, then try to reproduce the issue. Would that work for you?

Can you build a custom version of restic somewhere from source (cross-compilation is easy) or shall I provide a binary for you to run?

How far can we go in finding out what’s going on? Would you be comfortable sharing the data restic dumps with us (or at least me)? Would it be okay for you to even share the low-level encryption master key? That may be helpful at some point, but with more time (and more round trips) it should also work without revealing such details.

Here’s a dump of ideas that you can check:

  • The wrong plaintext was written during backup. Can you check if you are able to read the blob and get the right plaintext? Find the full ID: restic list blobs | grep "^090726a0", then run restic cat blob <id> | sha256sum
  • Make sure the pack file is correctly named. Find the full pack ID: restic list packs | grep "^6e320160", then run restic cat pack <id> | sha256sum

We’re using an optimized version of the sha256 hash (which uses SIMD instructions), maybe that one is faulty?

I’ll work on some code to dump all the data required for further analysis now.

I’ve pushed a commit to the branch debug-wrong-hash (commit). You can check it out and build it as follows:

$ git clone https://github.com/restic/restic
$ cd restic
$ git checkout debug-wrong-hash
$ go build -o restic ./cmd/restic

Then run restic check again with your settings. As soon as the most specific error (Blob ID does not match) appears, we’ll get the following information (I’ve injected a fault artificially):

$ ./restic check --read-data
using temporary cache in /tmp/zsh-1000/restic-check-cache-177529041
repository 15d74e04 opened successfully, password is correct
created new cache in /tmp/zsh-1000/restic-check-cache-177529041
create exclusive lock for repository
load indexes
check all packs
check snapshots, trees and blobs
read all data
pack d315341496ea08403abb4184450268218ead3730a206f59e1889c62b12db6d12: blob 23 (offset 96979, length 5930): ID does not match
  want           5a287919ca64847794bf605e89241673d27b511acaf7cdedfbcfe6b126a39c30
  got            aa6389495e1a88b75bb831fa2e1702037b952d997a180f66b15ce181f31c06d6
  crypto/sha256  aa6389495e1a88b75bb831fa2e1702037b952d997a180f66b15ce181f31c06d6
  SIMD sha256    aa6389495e1a88b75bb831fa2e1702037b952d997a180f66b15ce181f31c06d6
pack becca96c5bdb27956947b8420338e49f1f84db75dfd11ffa462c541e6deb806f: blob 23 (offset 168805, length 3200): ID does not match
  want           7e0a63deafe0367ca144c3ddc47290c3f5c84181d2499f2c8ad645e7358344e8
  got            c5aea5684f219b43dbfd0b9db20a6a512ba8fef6b621d70c7ae2c56fd1742500
  crypto/sha256  c5aea5684f219b43dbfd0b9db20a6a512ba8fef6b621d70c7ae2c56fd1742500
  SIMD sha256    c5aea5684f219b43dbfd0b9db20a6a512ba8fef6b621d70c7ae2c56fd1742500
pack d3153414 contains 1 errors: [Blob ID does not match, want 5a287919, got aa638949]
pack becca96c contains 1 errors: [Blob ID does not match, want 7e0a63de, got c5aea568]
[0:00] 100.00%  7 / 7 packs
Fatal: repository contains errors
[1]    39732 exit 1     ./restic check --read-data

It’ll write all kinds of different data to /tmp, so we can analyze it afterwards:

$ ls -al /tmp/*{.buf,.plaintext,.raw,.pack}
-rw------- 1 fd0 users    5930 Dez 20 14:50 /tmp/5a287919ca64847794bf605e89241673d27b511acaf7cdedfbcfe6b126a39c30.buf
-rw------- 1 fd0 users    5898 Dez 20 14:50 /tmp/5a287919ca64847794bf605e89241673d27b511acaf7cdedfbcfe6b126a39c30.plaintext
-rw------- 1 fd0 users    5930 Dez 20 14:50 /tmp/5a287919ca64847794bf605e89241673d27b511acaf7cdedfbcfe6b126a39c30.raw
-rw------- 1 fd0 users    3200 Dez 20 14:50 /tmp/7e0a63deafe0367ca144c3ddc47290c3f5c84181d2499f2c8ad645e7358344e8.buf
-rw------- 1 fd0 users    3168 Dez 20 14:50 /tmp/7e0a63deafe0367ca144c3ddc47290c3f5c84181d2499f2c8ad645e7358344e8.plaintext
-rw------- 1 fd0 users    3200 Dez 20 14:50 /tmp/7e0a63deafe0367ca144c3ddc47290c3f5c84181d2499f2c8ad645e7358344e8.raw
-rw------- 1 fd0 users 5351288 Dez 20 14:50 /tmp/becca96c5bdb27956947b8420338e49f1f84db75dfd11ffa462c541e6deb806f.pack
-rw------- 1 fd0 users 7005709 Dez 20 14:50 /tmp/d315341496ea08403abb4184450268218ead3730a206f59e1889c62b12db6d12.pack

The most interesting thing would be to run sha256sum on the *.plaintext files, everything else depends on the output of restic check.

I’d want to make sure the data wasn’t my plaintext list of missile launch codes, but I will most likely be fine sharing both with you or other developers so long as they aren’t posted publicly.

Confirmed OK.

$ sudo -u restic --preserve-env=B2_ACCOUNT_ID,B2_ACCOUNT_KEY restic -r b2:bucket-name cat blob 73b786ec7e6c1ea7d0d1ae078cccec07248349b45d000f11def2f5f9babee72b | sha256sum
enter password for repository:
73b786ec7e6c1ea7d0d1ae078cccec07248349b45d000f11def2f5f9babee72b  -

$ sudo -u restic --preserve-env=B2_ACCOUNT_ID,B2_ACCOUNT_KEY restic -r b2:bucket-name cat blob 090726a0a3fdec0aad46a727875b7ff3598f6a4dd74c4410b21085d02cf9378f | sha256sum
enter password for repository:
090726a0a3fdec0aad46a727875b7ff3598f6a4dd74c4410b21085d02cf9378f  -

$ sudo -u restic --preserve-env=B2_ACCOUNT_ID,B2_ACCOUNT_KEY restic -r b2:bucket-name cat blob 2b9f1aeceea423735f1d2a31153888c65ea619d27cf8b57478c56038cbbafa8a | sha256sum
enter password for repository:
2b9f1aeceea423735f1d2a31153888c65ea619d27cf8b57478c56038cbbafa8a  -

Also confirmed OK:

$ sudo -u restic --preserve-env=B2_ACCOUNT_ID,B2_ACCOUNT_KEY restic -r b2:bucket-name cat pack fe2e225c16ff29e0445b2953e08e6c3c65daacda27d89016469f9af5d25ae236 | sha256sum
enter password for repository:
fe2e225c16ff29e0445b2953e08e6c3c65daacda27d89016469f9af5d25ae236  -

$ sudo -u restic --preserve-env=B2_ACCOUNT_ID,B2_ACCOUNT_KEY restic -r b2:bucket-name cat pack 6e320160a550fe12222aaf2fecefa0e039181faeecac4000698340f4a6a062fc | sha256sum
enter password for repository:
6e320160a550fe12222aaf2fecefa0e039181faeecac4000698340f4a6a062fc  -

$ sudo -u restic --preserve-env=B2_ACCOUNT_ID,B2_ACCOUNT_KEY restic -r b2:bucket-name cat pack 7ff4089107bf42674cc417221cf051b6c9ab72acf02a73eb34dd16468056f036 | sha256sum
enter password for repository:
7ff4089107bf42674cc417221cf051b6c9ab72acf02a73eb34dd16468056f036  -

$ sudo -u restic --preserve-env=B2_ACCOUNT_ID,B2_ACCOUNT_KEY restic -r b2:bucket-name cat pack 51aa2bf43d75554d7f9bcfe8a19510d5da3b8c5be249410f076d76775b130e75 | sha256sum
enter password for repository:
51aa2bf43d75554d7f9bcfe8a19510d5da3b8c5be249410f076d76775b130e75  -

So, more verification that the data at rest is good and that the issue is transient.

I have considered if this could have anything to do with Intel security issues and the associated microcode updates. It’s an older CPU (Xeon E3-1220 v1). spectre-meltdown-checker reports it’s running the latest microcode, released in 2019, and that that microcode is not -known- to have stability issues.

For hardware stress testing, I ran linpack concurrently with mprime for 3 hours before switching to the below, no issues.

I set up 3 partial copies of the repository locally, did an add key, and ran 3 instances of the following fish shell loop with different directories and different pack IDs:

while test $status -eq 0
    env RESTIC_PASSWORD=not-my-password DEBUG_LOG=$PWD/restic-debug.log ~/restic-debug-1999/restic -r $PWD --no-cache debug examine --extract-pack fe2e225c16ff29e0445b2953e08e6c3c65daacda27d89016469f9af5d25ae236
end

Looking through the code in the 1999 branch, it doesn’t look like debug examine will exit with nonzero status in the case of my issue. But all the .bin files written out have the correct- prefix, and grep -i 'error' restic-local-repo*/restic-debug.log | grep -v 'repository/key.go:153' yields nothing. The exclusion there is because the primary key wasn’t changed and it throws an error when it attempts it.

3 instances of that loop (again, different local repos, different pack IDs) running for ~18 hours has failed to reproduce the issue. Not yet ready to say that this path won’t produce the issue, but getting there. I note that it spends most of its time examining the indexes. Will deleting snapshots reduce the number of indexes?

I’m using --no-cache in the loop to avoid any potential issues with the multiple processes running with a shared cache. I don’t see that affecting this issue, but it’s conceivable. Are we sure a disk issue couldn’t be at fault here?

Thanks! I have started this running.

No, but restic rebuild-index will.

I’m confident, yes, because the issues presented for a data blob, and pack files which contain only data blobs are not cached at all.

So I ran the three processes looping as described above, and I ran the debug-wrong-hash branch exactly as I had to originally produce this error. They ran for ~48 hours and check got through about half my repository. No errors! Those were compiled as described above with go 1.13.

THEN I recompiled using go 1.15.3 and using the flags and environment variables to make it more like the version (which was obtained with self-update) that caused the issue: $ ~/restic-debug-wrong-hash> env GOOS=linux GOARCH=amd64 CGO_ENABLED=0 ~/go_1_15_3/go/bin/go build -a -ldflags "-s -w" -tags "selfupdate" -o /tmp/restic_debug_wrong_hash_nocgo ./cmd/restic

Within 15 minutes this build erred:

$ sudo -u restic --preserve-env=B2_ACCOUNT_ID,B2_ACCOUNT_KEY /tmp/restic_debug_wrong_hash_nocgo -r b2:bucket-name check --read-data-subset=5/[sudo] password for craig:
using temporary cache in /tmp/restic-check-cache-055412826
enter password for repository:
repository 01618c22 opened successfully, password is correct
created new cache in /tmp/restic-check-cache-055412826
create exclusive lock for repository
load indexes
check all packs
check snapshots, trees and blobs
read group #5 of 38708 data packs (out of total 380897 packs in 10 groups)
pack 9aa3061fff7d55d5ebffde3e0b5ded32c036b77e030368800bde3f47f052dc86: blob 2 (offset 106847, length 3904461): ID does not match
  want           4e6113bf2ffdf5ba9afb8c7a1f1cede8344b07fee70cef82c919b4e7c89f5114
  got            c22dea2410d7b241ba96fd84c7d0f011c5de32931ba5d295fa7d540cc3ff0b73
  crypto/sha256  c22dea2410d7b241ba96fd84c7d0f011c5de32931ba5d295fa7d540cc3ff0b73
  SIMD sha256    c22dea2410d7b241ba96fd84c7d0f011c5de32931ba5d295fa7d540cc3ff0b73
pack 9aa3061f contains 1 errors: [Blob ID does not match, want 4e6113bf, got c22dea24]
  signal interrupt received, cleaning up
checkPack: Load: context canceled
checkPack: Load: context canceled
checkPack: Load: context canceled
[13:11] 5.89%  2278 / 38708 packs
Fatal: repository contains errors
$ ls -al /tmp/*{.buf,.plaintext,.raw,.pack}
-rwxrwxrwx 1 restic restic 3904461 Dec 22 14:08 /tmp/4e6113bf2ffdf5ba9afb8c7a1f1cede8344b07fee70cef82c919b4e7c89f5114.buf*
-rwxrwxrwx 1 restic restic 3904429 Dec 22 14:08 /tmp/4e6113bf2ffdf5ba9afb8c7a1f1cede8344b07fee70cef82c919b4e7c89f5114.plaintext*
-rwxrwxrwx 1 restic restic 3904461 Dec 22 14:08 /tmp/4e6113bf2ffdf5ba9afb8c7a1f1cede8344b07fee70cef82c919b4e7c89f5114.raw*
-rwxrwxrwx 1 restic restic 4750691 Dec 22 14:08 /tmp/9aa3061fff7d55d5ebffde3e0b5ded32c036b77e030368800bde3f47f052dc86.pack*

$ sha256sum /tmp/*{.buf,.plaintext,.raw,.pack}
034bb77650028f352718183fa4e412c1d1f6d8de8edf8c67177ff4dc29b184ed  /tmp/4e6113bf2ffdf5ba9afb8c7a1f1cede8344b07fee70cef82c919b4e7c89f5114.buf
c22dea2410d7b241ba96fd84c7d0f011c5de32931ba5d295fa7d540cc3ff0b73  /tmp/4e6113bf2ffdf5ba9afb8c7a1f1cede8344b07fee70cef82c919b4e7c89f5114.plaintext
6ae1497be85e75e6f2ec171e98778ae58a3abb1a3f59d96da444863e9ec330d1  /tmp/4e6113bf2ffdf5ba9afb8c7a1f1cede8344b07fee70cef82c919b4e7c89f5114.raw
9aa3061fff7d55d5ebffde3e0b5ded32c036b77e030368800bde3f47f052dc86  /tmp/9aa3061fff7d55d5ebffde3e0b5ded32c036b77e030368800bde3f47f052dc86.pack

$ sudo -u restic --preserve-env=B2_ACCOUNT_ID,B2_ACCOUNT_KEY /tmp/restic_debug_wrong_hash_nocgo -r b2:bucket-name cat blob 4e6113bf2ffdf5ba9afb8c7a1f1cede8344b07fee70cef82c919b4e7c89f5114 > 4e6113bf2ffdf5ba9afb8c7a1f1cede8344b07fee70cef82c919b4e7c89f5114_good
[sudo] password for craig:
enter password for repository:

$ sha256sum 4e6113bf2ffdf5ba9afb8c7a1f1cede8344b07fee70cef82c919b4e7c89f5114_good
4e6113bf2ffdf5ba9afb8c7a1f1cede8344b07fee70cef82c919b4e7c89f5114  4e6113bf2ffdf5ba9afb8c7a1f1cede8344b07fee70cef82c919b4e7c89f5114_good

$ cmp /tmp/4e6113bf2ffdf5ba9afb8c7a1f1cede8344b07fee70cef82c919b4e7c89f5114.plaintext 4e6113bf2ffdf5ba9afb8c7a1f1cede8344b07fee70cef82c919b4e7c89f5114_good
/tmp/4e6113bf2ffdf5ba9afb8c7a1f1cede8344b07fee70cef82c919b4e7c89f5114.plaintext 4e6113bf2ffdf5ba9afb8c7a1f1cede8344b07fee70cef82c919b4e7c89f5114_good differ: byte 2665953, line 1698

# Note that below backs up the cursor 32 bytes
$ tail -c +2665921 /tmp/4e6113bf2ffdf5ba9afb8c7a1f1cede8344b07fee70cef82c919b4e7c89f5114.plaintext | head -c64
ent or inactive","0,":"PIR senso��3 �(�O�"�O��, inactive","BEDN⏎

$ tail -c +2665921 4e6113bf2ffdf5ba9afb8c7a1f1cede8344b07fee70cef82c919b4e7c89f5114_good | head -c64
ent or inactive","0,":"PIR sensor not present or inactive","BEDN⏎

$ tail -c +2665921 /tmp/4e6113bf2ffdf5ba9afb8c7a1f1cede8344b07fee70cef82c919b4e7c89f5114.plaintext | head -c64 | hexdump
0000000 6e65 2074 726f 6920 616e 7463 7669 2265
0000010 222c 2c30 3a22 5022 5249 7320 6e65 6f73
0000020 f5c7 2033 28f6 4fa6 22e0 4ff0 87fd 2cc0
0000030 6920 616e 7463 7669 2265 222c 4542 4e44
0000040

$ tail -c +2665921 4e6113bf2ffdf5ba9afb8c7a1f1cede8344b07fee70cef82c919b4e7c89f5114_good | head -c64 | hexdump
0000000 6e65 2074 726f 6920 616e 7463 7669 2265
0000010 222c 2c30 3a22 5022 5249 7320 6e65 6f73
0000020 2072 6f6e 2074 7270 7365 6e65 2074 726f
0000030 6920 616e 7463 7669 2265 222c 4542 4e44
0000040

Looks like 16 bytes are corrupt in the decryption!

The looping processes were also recompiled as $ env GOOS=linux GOARCH=amd64 CGO_ENABLED=0 ~/go_1_15_3/go/bin/go build -a -ldflags "-s -w" -tags "debug selfupdate" -o /tmp/restic_1999_nocgo ./cmd/restic and just now I got a different error: 2020/12/22 14:24:05 repository/key.go:153 repository.SearchKey.func1 1 key 94b1ff.... returned error ciphertext verification failed. Restarting the process using the same command works as expected - something with this build is definitely different.

I’ll test next with my system version of go but keeping CGO_ENABLED=0 and the linker flags.

2 Likes

I ran with system go version 1.13 for a few hours without issues, so I’m… 20% confident it isn’t affected.

I switched to the latest, 1.15.6, and one of my --extract-pack loops failed checksum, but not in a way consistent with the above:

$ sha256sum wrong-hash-5891a4075ad59495b904cb6bc9fb1802680ad3680a09e90cc6a96439313d7678.bin correct-090726a0a3fdec0aad46a727875b7ff3598f6a4dd74c4410b21085d02cf9378f.bin
090726a0a3fdec0aad46a727875b7ff3598f6a4dd74c4410b21085d02cf9378f  wrong-hash-5891a4075ad59495b904cb6bc9fb1802680ad3680a09e90cc6a96439313d7678.bin
090726a0a3fdec0aad46a727875b7ff3598f6a4dd74c4410b21085d02cf9378f  correct-090726a0a3fdec0aad46a727875b7ff3598f6a4dd74c4410b21085d02cf9378f.bin

restic thought the hash was wrong, but it’s the same as the correct one. Unfortunately I don’t have the stdout (now being saved) and it didn’t write anything different to the debug log as I expected it to.

This might be a different bug. In 1 post above, the encryption seems to have resulted in corruption with the hashes being correct. Here the encryption seems to have been performed correctly, yet the hashing failed.

Thanks for the feedback! This becomes even more mysterious the longer I look at it. I’ve verified that nothing changed in the relevant code between Go 1.13 and 1.15.6. :confused:

You observed that exactly 16 byte were corrupted, aligned on a block boundary (16 bytes is the block size for AES). We’re using AES-CTR, which works by generating a key stream of 16 byte blocks, which are then XORed byte-wise onto the ciphertext to produce the plaintext.

I’ve verified that we’re not using the decryption routines from the Go standard library incorrectly (well, at least not in an obvious way).

The decryption involves AES extension in the CPU as well as SSE2 (for doing the XOR) and the system memory.

I’ve pushed another commit that modifies a test benchmark function so that the exact decryption procedure the checker uses is tested, can you please run the benchmark and see if it fails? If it fails, it writes data to /tmp again (commit is here):

git pull
go test -run xxx -bench Decrypt -benchtime 20m ./internal/crypto

You can also run that with different Go versions or for a longer time (then use -benchtime accordingly)… Let’s see if something turns up. If an error is detect, it fails instantly.

I’m so confused. On Go 1.15.6 I see the same type of error as on 1.15.3:

pack 9cbacebe293291081dc359b103c0e8cb4372e5c6f8d30ffbe294d57cfad4dcfb: blob 2 (offset 2142834, length 6163824): ID does not match
  want           4a959ab43cfa1846c1dca4735c3cf666c07e493717bf22ad7241cd39229b270d
  got            0a21b589d4f88f7818a37c06b7f9ee6002fe97615c9ca25abf2cac98cbc6f635
  crypto/sha256  0a21b589d4f88f7818a37c06b7f9ee6002fe97615c9ca25abf2cac98cbc6f635
  SIMD sha256    0a21b589d4f88f7818a37c06b7f9ee6002fe97615c9ca25abf2cac98cbc6f635
pack 9cbacebe contains 1 errors: [Blob ID does not match, want 4a959ab4, got 0a21b589]

The first line does not match the other 3 → suggests crypto issue, not hashing.

But on 1.14 I see a different issue, one that suggests restic.Hash(plaintext) != (cryptosha256.Sum256(plaintext) == simdsha256.Sum256(plaintext)). But I get the feeling that running it through restic.Hash again would have come up with the correct result.

pack ec845d3f85ee06fa0637d408de009101da275d92dbd9afa524d875e5441162a3: blob 0 (offset 0, length 1644856): ID does not match
  want           3122e1158b2cbbf39c047a80677f59973ae221c7c394c7b988c23c0b9681f0cb
  got            b48b74038e8e23c9a5c1b894c66f5c294d747e53aba296bab4d4d70af93fef88
  crypto/sha256  3122e1158b2cbbf39c047a80677f59973ae221c7c394c7b988c23c0b9681f0cb
  SIMD sha256    3122e1158b2cbbf39c047a80677f59973ae221c7c394c7b988c23c0b9681f0cb
pack ec845d3f contains 1 errors: [Blob ID does not match, want 3122e115, got b48b7403]
pack b0ea15231b37984347baa3bcb32522a0376a18a015a98fbb73421ac395214c33: blob 2 (offset 2170750, length 1915009): ID does not match
  want           4a49a89e834eba3ee23902ef81f84b798369c0acba477736a5dbeb349594b7d2
  got            3be95bc3a51438235929841a7195b0edeed2922758dfd9a19565e5447b865288
  crypto/sha256  4a49a89e834eba3ee23902ef81f84b798369c0acba477736a5dbeb349594b7d2
  SIMD sha256    4a49a89e834eba3ee23902ef81f84b798369c0acba477736a5dbeb349594b7d2
pack b0ea1523 contains 1 errors: [Blob ID does not match, want 4a49a89e, got 3be95bc3]

# ran cat blob to pull down the blobs, hashes match
$ sha256sum *_good
4a49a89e834eba3ee23902ef81f84b798369c0acba477736a5dbeb349594b7d2  4a49a89e834eba3ee23902ef81f84b798369c0acba477736a5dbeb349594b7d2_good
3122e1158b2cbbf39c047a80677f59973ae221c7c394c7b988c23c0b9681f0cb  3122e1158b2cbbf39c047a80677f59973ae221c7c394c7b988c23c0b9681f0cb_good

$ sha256sum /tmp/*.plaintext
4a49a89e834eba3ee23902ef81f84b798369c0acba477736a5dbeb349594b7d2  /tmp/4a49a89e834eba3ee23902ef81f84b798369c0acba477736a5dbeb349594b7d2.plaintext
3122e1158b2cbbf39c047a80677f59973ae221c7c394c7b988c23c0b9681f0cb  /tmp/3122e1158b2cbbf39c047a80677f59973ae221c7c394c7b988c23c0b9681f0cb.plaintext

It’s possible a bug exists that triggers both these symptoms consistently on a per-version basis and I just haven’t run them long enough to trigger both kinds. I’ll try to keep track of the different versions and issues I’ve seen.

Thanks! I’ll give this a shot.

1 Like

Another idea what may happen here is that somehow different goroutines within restic use the same memory. Can you please try to reproduce the issue again after building restic with the -race flag, like (go build -race [...])? If the race detector detects a race, we’ll get the stack trace for the two goroutines involved (which is really helpful).

When I first tried to do this, I got

~/restic-1999> env GOOS=linux GOARCH=amd64 CGO_ENABLED=0 ~/go_1_14/go/bin/go build -a -race -ldflags "-s -w" -tags "debug selfupdate" -o /tmp/restic_1999_nocgo ./cmd/restic
go build: -race requires cgo; enable cgo by setting CGO_ENABLED=1

~/restic-1999> env GOOS=linux GOARCH=amd64 CGO_ENABLED=1 ~/go_1_14/go/bin/go build -a -race -ldflags "-s -w" -tags "debug selfupdate" -o /tmp/restic_1999_nocgo ./cmd/restic
runtime: mlock of signal stack failed: 12
runtime: increase the mlock limit (ulimit -l) or
runtime: update your kernel to 5.3.15+, 5.4.2+, or 5.5+
fatal error: mlock failed
<runtime stack>

That led me to these GitHub issues: runtime: mlock of signal stack failed: 12 · Issue #37436 · golang/go · GitHub , runtime: memory corruption on Linux 5.2+ · Issue #35777 · golang/go · GitHub and this kernel issue: 205663 – AVX register corruption from signal delivery (“AVX register corruption from signal delivery”). I’m on kernel version 5.3.1 so it seems very plausible this is causing all the issues. The test program in the bugzilla triggers the corruption on my machine.

Unless you’d prefer to pursue other avenues, I’ll upgrade my kernel and re-test restic.

2 Likes

Oh wow, that matches pretty well! Btw, since the bug is in the kernel, it also affects other programs, so now you know…

I’m pretty amazed how well restic’s internal checking mechanisms work :slight_smile:

1 Like

Setting the environment variable GODEBUG=asyncpreemptoff=1 should then be able to workaround the corruption. But upgrading the kernel is probably the better solution.

According to ianlancetaylor in the golang issue thread, that will reduce but not eliminate the probability of corruption.

Asynchronous preemption was introduced in Go 1.14, and that does the right dance to trigger the kernel issue. Golang mitigation for the kernel issue in the form of mlocking the signal stack was introduced in Go 1.14. Due to issues with it, it was disabled for Ubuntu 5.4 in Go 1.15 and removed entirely for Go 1.16.

Because I experienced the corruption on 1.15, not on Ubuntu, I can think of a few things:

  • The mitigation incorrectly detected my OS or kernel version and was not enabled
  • The mitigation is broken/ineffective on my system
  • The mitigation did not cover all cases, perhaps something specific to restic
  • This is not the bug we are looking for

For others that could be experiencing this issue, “This bug was introduced in Linux 5.2 and fixed in 5.3.15, 5.4.2 and all 5.5 and later kernels. The fix was also back-ported by major distros.” - Ref

After some pain I upgraded to kernel 5.4.80-2-lts and am running tests. Thus far things look good, but it’s too early to call it.

3 Likes

I ran tests using Go 1.15.3 with my kernel updated to 5.4.80-2-lts. Other packages on the system had to be updated as well, but I don’t know of any that would have affected system stability. I have encountered none of the previous issues after lengthy testing and am now fairly confident that the kernel bug was the root of the issue. Glad there’s no issue with my hardware and even more glad there’s no issue with restic.

Someone in one of the bug threads suggested that there would be a ‘vanishingly small’ number of users on affected kernels, and yet as one of those users I have to wonder how many others there are. Does it make any sense for me to write something for the docs about troubleshooting consistency errors and include checking the kernel version?

Thanks to everyone who helped track this down!

4 Likes