Hi @ifedorenko,
Replying to the last part of your message first:
You really mean, “once corrupted blobs were written to the repository and a restic snapshot was written”, correct? The distinction is important in my case because I had to go through like a dozen interrupted restic backup
executions before one was able to finish and write its first restic snapshot containing those files…
OK, I think I see it now: either of these situations would necessarily have happened in that exact restic backup
which managed to complete, which is when the first restic snapshot got written, correct? I ask because if it had happened before this one, this last backup would have no way to check size+mtime, and would have recalculated and then just ignored the corrupted blobs… or am I still missing something?
I fully agree: ECC is mathematically guaranteed to detect and correct any 1-bit error per block, and to detect and alarm (usually causing a system panic) any 2-bit errors. Here are the differences I see on the messed up parts of these files (all offsets in decimal and all byte values in octal, as generated by cmp -l
):
First File:
553 10 5
556 152 164
557 144 145
558 141 163
559 166 164
560 141 145
352877 140 40
352878 60 110
352879 41 176
352880 53 173
352881 252 57
352882 6 127
352883 325 324
Second File:
119405 20 0
119406 344 0
119407 71 0
119408 47 0
119409 252 0
119410 6 0
119411 325 0
119412 1 0
As we can see, the above would mean a lot of 1- and 2-bit errors, so even if I had bad memory failing all the time, the above errors are about guaranteed not to happen.
Of course, as I did not install that server’s memory myself (it came fully built from an Apple reseller), it could have happened that someone in the reseller swapped the ECC RAM that Apple ships it with for non-ECC RAM. But as far as I was able to Google, seems a MacPro 2013 like ours won’t even boot with non-ECC RAM.
Anyway, let’s be paranoid: the next time I can schedule a shutdown on that server, I will open it up and visually check whether it really has ECC RAM.
And I think we can also discard the possibility that somehow the disk(s) where those files ended up being stored silently returned corrupted blocks when they were read by restic backup
: everything here is on ZFS, which saves and verifies a strong checksum for each disk block. Such an event would have resulted in a CKSUM being displayed by zpool status
for that hypothetical device, and (as I have raidz2 redundancy) the corrupted block would then be re-read/regenerated from the pool redundancy. So, no chance of that happening here either (apart from a checksum collision, which I think is really, really unlikely, using the same rationale for SHA256 corruption in my previous email).
Would love to hear @fd0’s opinion on all this too.
But it seems beyond doubt that the blobs these files belong to are corrupted. And that would be on all snapshots since the first What would be the best way to ‘repair’ them in the repo, as fast as possible, and losing as little as possible of those snapshots? perhaps I can somehow identify the bad blobs and remove them from the repo, or even correct them, saving fixed ones in their place?
Cheers,
– Durval.