What are potential causes of incorrect hashes during a check?

nstgc · November 30, 2024, 12:15am

I recently ran restic check --read-data which returned

pack 9d6e0cd775ff0f962bf02e3a983cfcd4512967181b0cad40149c1660428c849a contains 1 errors: [blob 1920b1ede259c5c5f9d433e14efa1fb2f054e8c28bb7500c04c56af11c3a277b: read blob <data/1920b1ed> from 9d6e0cd7: wrong data returned, hash is c288f3a5dbff81c3a297646a4a1c21142f494499a79113fc2397bf8af7e6cd8e]

I’m not asking for help fixing this error, the output tells me which steps are needed. Rather, I’m trying to figure out where it came from.

In particular, I’m using a checksummed filesystem, BCacheFS. If it was due to on-disk corruption, I’d expect the kernel/FS to complain. If I stored the checksum incorrectly at the time of backup, then I’d expect Restic to have told me about it then. If it is caused by a miscalculated checksum, or uncorrected data corruption in transmission (from my NAS), I’d expect it to not report an error on a subsequent check of that same data. Currently I’m rerunning the check, but it seems unlikely that it will come back clean.

No matter what the issue is, I need to know the source if I am to address it.

edit: As an aside, I’ve already checked SMART. Also, I realize BCacheFS is an experimental FS. That’s largely why I think it’s so important to identify the cause in this case so if need be, I can report the issue to the developer.

edit2: I can’t say for sure, but the most likely possibility is that this is from that one bad version of Restic that was causing data corruption. I had the same issue on Arch, but Arch moves much faster than NixOS which is what the systems involved this time uses.

martinleben · November 30, 2024, 12:34am

Hi @nstgc !

I guess that your setup is like this:

Host1: The device on which you run restic.
Host2: Your NAS which has your restic repository.

When backing up, restic reads your files, encrypts, creates hashes and other things. It creates files in your repository. Every file there is named after its checksum. Between your source files and the files in the repository, try here are a lot of places where data could have been garbled in some way. Suspects:

Host1 RAM.
Host1 CPU.
Host2 RAM.
Host2 CPU

In all of these components (and maybe some more), data is passing through without checksum and corruption could happen without any piece of SW noticing it.

nstgc · November 30, 2024, 12:45am

That’s correct. I run the restic on my desktop with restic -r sftp://....

I’ve had CPU issues in the past (I’m on my fourth Raptor Lake CPU), so if I had to guess, I’d assume the issue is there.

I thought in recent versions of Restic, after backing data up it checks that data to make sure it’s stored correctly? If that is the case, and a checksum was stored incorrectly or if the wrong data was stored, it would have been caught at the time of backup, yes?

Is it incorrect for me to assume that at some point the correct data and checksum existed in the repo?

martinleben · November 30, 2024, 2:56pm

I don’t know. Someone else need to come in here.

I guess so, since that would be the entire purpose.

If 1) restic performs a read-back when backing up AND 2) the was successful, then it is guaranteed that correct data was on disk in the repo. But, as written above, I don’t know if read-back is performed.

kapitainsky · November 30, 2024, 3:14pm

What restic does is validating data (blobs) before upload:

github.com/restic/restic

backup: verify blobs before upload

restic:master ← MichaelEischer:verify-integrity-on-upload

opened 05:20PM - 03 Feb 24 UTC

MichaelEischer

+310 -34

What does this PR change? What problem does it solve? ----------------------------------------------------- The repository struct now verifies blobs before uploading them. This systematically prevents corrupted blobs like in https://github.com/restic/restic/issues/4677 from being stored in the repository. Thereby, backups gain protection from random bitflips that causes data corruption or bugs in the encryption/compression libraries. The verification now covers blobs, the pack header as well as unpacked files.  Was the change previously discussed in an issue or on the forum? ---------------------------------------------------------------- Fixes https://github.com/restic/restic/issues/4529 .  Checklist ---------  - [x] I have read the [contribution guidelines](https://github.com/restic/restic/blob/master/CONTRIBUTING.md#providing-patches). - [x] I have [enabled maintainer edits](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/allowing-changes-to-a-pull-request-branch-created-from-a-fork). - [x] I have added tests for all code changes. - [x] I have added documentation for relevant changes (in the manual). - [x] There's a new file in `changelog/unreleased/` that describes the changes for our users (see [template](https://github.com/restic/restic/blob/master/changelog/TEMPLATE)). - [x] I have run `gofmt` on the code in all commits. - [x] All commit messages are formatted in the same style as [the other commits in the repo](https://github.com/restic/restic/blob/master/CONTRIBUTING.md#git-commits). - [x] I'm done! This pull request is ready for review.

To verify uploaded data you have to run restic check with --read-data flag.

nstgc · November 30, 2024, 11:56pm

Okay, that helps A LOT! I’m on NixOS and, they can be kind of slow to get things in the stable channel. All the errors are in snapshots around the same time earlier this year, so I’m guessing this is from before they updated to the more recent version of Restic that has this feature. Thanks!

@martinleben Thank you, as well!

edit: Fortunately, using the find --blob command, all the data has been inconsequential, by the way.

MichaelEischer · December 1, 2024, 8:38pm

nstgc:

pack 9d6e0cd775ff0f962bf02e3a983cfcd4512967181b0cad40149c1660428c849a contains 1 errors: [blob 1920b1ede259c5c5f9d433e14efa1fb2f054e8c28bb7500c04c56af11c3a277b: read blob <data/1920b1ed> from 9d6e0cd7: wrong data returned, hash is c288f3a5dbff81c3a297646a4a1c21142f494499a79113fc2397bf8af7e6cd8e]

Which restic version are you using exactly?

This particular error must have happened on the host creating the backup or a bitflip while checking the data. If it were data corruption on the storage side or during transfer, then I’d expect restic to complain about a hash mismatch of the pack file itself. But there’s only an error regarding the blob. So either the error occurred during the check or while assembling the packfile.

Only starting from 0.16.4.

Make sure to fix the repository such that check no longer complains about an error.

nstgc · December 2, 2024, 8:50pm

Currently, I’m using 17.3, but the effected snapshots are from April, prior to NixOS 24.05, so there’s a good chance I was on 0.16.3 or whichever version it was that was causing corruption. At some point I changed my system config to track Restic from the unstable branch instead of stable, but I can’t remember when that was (and I don’t keep the Git repo for my .nix files as up-to-date as I should).

Working on that currently. I used the repair packs and repair snapshots --forget. Now I’m just checking it, but it’s over 4TB stored on spinning rust, so I probably won’t know the results until tomorrow.

MichaelEischer · December 8, 2024, 4:58pm

Starting from restic 0.16.0 the snapshot metadata contain the restic version that originally created it. You can run restic cat snapshot <snapshotID> to take a look.

nstgc · December 9, 2024, 8:37pm

  "program_version": "restic 0.16.3"

Thanks! I figured that was the case, but it really puts my mind at ease to know this is the result of an already resolved software bug that I already knew about, rather than something new, be it a hardware fault or a new software bug.