I recently ran restic check --read-data which returned
pack 9d6e0cd775ff0f962bf02e3a983cfcd4512967181b0cad40149c1660428c849a contains 1 errors: [blob 1920b1ede259c5c5f9d433e14efa1fb2f054e8c28bb7500c04c56af11c3a277b: read blob <data/1920b1ed> from 9d6e0cd7: wrong data returned, hash is c288f3a5dbff81c3a297646a4a1c21142f494499a79113fc2397bf8af7e6cd8e]
I’m not asking for help fixing this error, the output tells me which steps are needed. Rather, I’m trying to figure out where it came from.
In particular, I’m using a checksummed filesystem, BCacheFS. If it was due to on-disk corruption, I’d expect the kernel/FS to complain. If I stored the checksum incorrectly at the time of backup, then I’d expect Restic to have told me about it then. If it is caused by a miscalculated checksum, or uncorrected data corruption in transmission (from my NAS), I’d expect it to not report an error on a subsequent check of that same data. Currently I’m rerunning the check, but it seems unlikely that it will come back clean.
No matter what the issue is, I need to know the source if I am to address it.
edit: As an aside, I’ve already checked SMART. Also, I realize BCacheFS is an experimental FS. That’s largely why I think it’s so important to identify the cause in this case so if need be, I can report the issue to the developer.
edit2: I can’t say for sure, but the most likely possibility is that this is from that one bad version of Restic that was causing data corruption. I had the same issue on Arch, but Arch moves much faster than NixOS which is what the systems involved this time uses.
When backing up, restic reads your files, encrypts, creates hashes and other things. It creates files in your repository. Every file there is named after its checksum. Between your source files and the files in the repository, try here are a lot of places where data could have been garbled in some way. Suspects:
Host1 RAM.
Host1 CPU.
Host2 RAM.
Host2 CPU
In all of these components (and maybe some more), data is passing through without checksum and corruption could happen without any piece of SW noticing it.
That’s correct. I run the restic on my desktop with restic -r sftp://....
I’ve had CPU issues in the past (I’m on my fourth Raptor Lake CPU), so if I had to guess, I’d assume the issue is there.
I thought in recent versions of Restic, after backing data up it checks that data to make sure it’s stored correctly? If that is the case, and a checksum was stored incorrectly or if the wrong data was stored, it would have been caught at the time of backup, yes?
Is it incorrect for me to assume that at some point the correct data and checksum existed in the repo?
I guess so, since that would be the entire purpose.
If 1) restic performs a read-back when backing up AND 2) the was successful, then it is guaranteed that correct data was on disk in the repo. But, as written above, I don’t know if read-back is performed.
Okay, that helps A LOT! I’m on NixOS and, they can be kind of slow to get things in the stable channel. All the errors are in snapshots around the same time earlier this year, so I’m guessing this is from before they updated to the more recent version of Restic that has this feature. Thanks!
This particular error must have happened on the host creating the backup or a bitflip while checking the data. If it were data corruption on the storage side or during transfer, then I’d expect restic to complain about a hash mismatch of the pack file itself. But there’s only an error regarding the blob. So either the error occurred during the check or while assembling the packfile.
Only starting from 0.16.4.
Make sure to fix the repository such that check no longer complains about an error.
Currently, I’m using 17.3, but the effected snapshots are from April, prior to NixOS 24.05, so there’s a good chance I was on 0.16.3 or whichever version it was that was causing corruption. At some point I changed my system config to track Restic from the unstable branch instead of stable, but I can’t remember when that was (and I don’t keep the Git repo for my .nix files as up-to-date as I should).
Working on that currently. I used the repair packs and repair snapshots --forget. Now I’m just checking it, but it’s over 4TB stored on spinning rust, so I probably won’t know the results until tomorrow.
Starting from restic 0.16.0 the snapshot metadata contain the restic version that originally created it. You can run restic cat snapshot <snapshotID> to take a look.
Thanks! I figured that was the case, but it really puts my mind at ease to know this is the result of an already resolved software bug that I already knew about, rather than something new, be it a hardware fault or a new software bug.