Possible backup bug? Or bad backend?

akrabu · October 25, 2023, 4:41am

I do believe this was all due to #4523. I always use max compression. And this was a fairly substantial repo. I’d wager my chances of hitting the bug were higher than the average user. Could be wrong, but I’m very happy this bug was found, because this issue was making me paranoid lol

Only thing is, I never did find any errors in my repo… but I’ll do more testing with v0.16.1 and hope for the best.

nicnab · October 25, 2023, 6:47am

Keeping my thumbs crossed!

MichaelEischer · October 25, 2023, 9:10pm

The data corruption issue in #4523 is 100% reproducible. That is, the problem won’t randomly disappear like it has happened here as far as I remember.

check --read-data reports any damaged caused by #4523. If check does not complain about blobs with an unexpected hash, then you’re not affected by this specific issue.

MelcomX · October 29, 2023, 1:16pm

I took a look at the pictures and took a look at their hex dump. Whilst I did not parse every line, I noticed that the pictures seemed to be the same until a specified byte, after that it seemed like, there is something very special going on. So the first two times 8 bits are a conversion from 0 to 8, then there are four matching bits. But by far more interesting are the following 8 bytes, which are in their original form just 0. But in the corrupted form they are the following:
“228a 228a a288 8a28” , which seems to me either an error code, which was not caught by restig or a bit of randomness (like reading out of memory or corruption by external factors–> bitflips induced by whatever).

So Let us take a look at 8a28, which has an occurrence of 10 in the correct file and 18 in the corrupted form. But now here is the interesting part: At byte 0x2ac there is this 0x8a28 occurring in the correct form preceded by 0xa288, every other occurrence is not preceded by this. But in the corrupted file, there are these two bytes occur quite often, and almost always it is preceded by at leats 0xa288, and many times it is accompanied by the other 4 bytes mentioned

So, as my time is running out, for now, I conclude that this is probably not true randomness and I can imagine that there is corruption being somewhere while reading. For me, it looks like either an out-of-bound read or an error code not being caught.

DISCLAIMER: My analysis is based on grouping the files in two pairs and not as I would like to on a bit level, as that may catch more symmetries, as for example a bitdrop somewhere. But as there are approx 4KB missing I don’t know. Also, the ending does not seem to match on a bit level which would rule out a bitdrop/insertion. But I just wanted to note, that my analysis is not prone to errors.

fede · October 30, 2023, 10:49pm

Same here! But I’m using Restic and Duplicati just in case

akrabu · October 31, 2023, 1:58am

Shoot. I was hoping that was it. I haven’t been able to reproduce this issue. The drive I was backing up to appears healthy. No reallocated sectors, no I/O errors, no UDMA errors.

It’s not my primary backup, so I’ve fully overwritten the backup several times (~15.5TB), using rsync with --checksum on a second pass for verification. Can’t get it to do anything out of the ordinary. So I don’t know what it was…

akrabu · April 9, 2024, 3:20am

Ha! I bet this was the compression bug fixed in 0.16.4. I always use max compression for non-local backends.

EDIT: Oh wait, I just realized that’s probably the same thing as #4523 lol. Forgot all about that. Oh well. Knock on wood, haven’t had any other issues since upgrading.

MichaelEischer · April 9, 2024, 6:28pm

Only 0.16.0 and 0.16.3 are affected by compression bugs. The bug in 0.16.3 does not exist in 0.16.2.

akrabu · April 9, 2024, 6:35pm

Ah, I was on 0.16.0 - but you said that probably wasn’t because of that compression bug. But I haven’t had the issue since, either - knock on wood. Oh well, we may never know!

akrabu · July 7, 2024, 6:01am

@MichaelEischer You’ll be happy to know the new bitrot detection is working well in the v0.17.0 beta. I held onto that corrupted snapshot. I now get:

M?   /Volumes/Media/Photos & Videos/_To Sort/!Images/Pictures/2010/2010-08-21 (SAHS Band)/IMG_3952 [CCD356CA].JPG
M?   /Volumes/Media/Photos & Videos/_To Sort/!Images/Pictures/2010/2010-08-21 (Selfies)/IMG_0001 [2CC5C403].JPG
M?   /Volumes/Media/Photos & Videos/_To Sort/!Images/Pictures/2010/2010-08-21 (Selfies)/IMG_0003 [AC7B8288].JPG
M?   /Volumes/Media/Photos & Videos/_To Sort/!Images/Pictures/2010/2010-08-21 (Selfies)/IMG_0004 [A1A2645F].JPG
M?   /Volumes/Media/Photos & Videos/_To Sort/!Images/Pictures/2010/2010-08-21 (Selfies)/IMG_0005 [901D19E0].JPG

But if I compare the “b4_cshatag” snapshot with the “bugcheck” snapshot I ran after I found the corrupted snapshot, there’s no “M?”.

I never figured out why it happened, and as far as I know, it’s never happened again. But now I at least have a way to check by using backup --force and then diff | grep looking for “M?” lines!

akrabu · March 21, 2025, 4:11am

I had this happen again recently. On a different Mac. Totally different hardware. Backing up 2TB with Restic. This time, though - there had been a kernel panic right after the first backup. The snapshot completed successfully, though. After a reboot, I started a second snapshot for good measure. Did a diff. Noticed data changed when I didn’t expect it to, and it had the “M?” beside the files when I did a restic diff. Did a third snapshot. No differences between snapshots 1 and 3, and the same files were “M?” between 1-2, and 2-3.

I think I might have figured out the culprit. The only thing in common between the two Macs is the SAT SMART Driver I installed to use DriveDx on USB volumes. After having that kernel panic right after taking a huge snapshot, I fed the log file through ChatGPT… and that seemed to be the culprit of the kernel panic.

It only happened sporadically. I’d also notice occasional blips when doing SnapRAID scrubs - a file would appear corrupt, but I have the CRC32 in nearly every filename on that array, and it would always test out fine. Checking it with SnapRAID a second time would also show it as fine. I don’t think I have two Macs with similarly bad hardware. Pretty sure it’s this 3rd party kernel extension. It always happened on USB devices, too - never the Thunderbolt drives, and lord knows I write terabytes to those too.

Soooo once again, Restic is the canary in the coal mine. I can’t be certain yet, I’ll have to do a few more day-long SnapRAID scrubs… but I think I figured it out!

It’s ironic I posted a screenshot of DriveDx up above to show my drives were okay, meanwhile the driver I used to do it was most likely the culprit haha

Also no wonder it’s buggy… this driver hasn’t been updated since 2014 lol

Update: After scrubbing 50% of my array (total size: 54.5 TiB), I found three errors. No kernel panics this time—possibly thanks to removing that driver, which might have been the culprit—but I still encountered some random errors. To confirm whether they were real, I ran snapraid -e check before fixing them, and… they turned out to be fine. I don’t have any third-party drivers in common between the two systems, but there is one other shared element: both are connected through CalDigit Thunderbolt docks.

Going to plug the array in directly and bypass the dock. I’ll let it scrub another 50%, and see what happens.