Invalid data returned ... but healthy repo

OS: W10
restic: version 0.14.0

This is a Python script. It’s doing “restic backup” to two local repos, one on an internal HD the other on an external HD. Just discovered the local HD repo has been failing for the past 7 hours. No snapshots have been created over this period. The repo on the external HD is working fine.

This is the stderr output from the process:
stderr |Error loading snapshot e717c141: load <snapshot/e717c1417c>: invalid data returned
github.com/restic/restic/internal/restic.FindLatestSnapshot.func1
/restic/internal/restic/snapshot_find.go:40
github.com/restic/restic/internal/restic.ForAllSnapshots.func2
/restic/internal/restic/snapshot.go:122
golang.org/x/sync/errgroup.(*Group).Go.func1
/home/build/go/pkg/mod/golang.org/x/sync@v0.0.0-20220819030929-7fc1605a5dde/errgroup/errgroup.go:75
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1594
|

Every time the script runs it fails in the same way, complaining about the same offending snapshot, e717c1417c.

On running “restic check” on the offending repo I get “no errors were found”.

I tried “restic forget”:

>restic -r "E:\Backups\restic\My documents" --verbose -p "D:\..." forget e717c1417c
Ignoring "e717c1417c4298792fa21938accb913e80c3a30ad631c95429ba2963d5901284", could not load snapshot: load <snapshot/e717c1417c>: invalid data returned

What’s the best thing to do here?

NB Seems slightly strange that this backup process involves “loading” a snapshot… or could this be a reference to the new snapshot? This seems unlikely given the repeated complaints about e717c1417c each time. So why should this prevent executing a new backup on what is reported as a healthy repo …?

Obviously a new snapshot needs to make reference to the repo’s existing blobs, but I’d have thought that if one snapshot was found to be invalid, but the overall repo was healthy, it would then just move on to another “reference snapshot”, and merely issue a warning about the dud one. Not that I’m claiming to understand anything about the nuts and bolts of this magnificent creation!

restic has to inspect all snapshots while looking for a parent snapshot for the new backup.

It would be possible to handle this error case more gracefully, but I’m not sure how much that’s worth. If an HDD breaks snapshots, then it might also break the whole backup. Another downside of adding such a special case is that it complicates the code, which has to be weighed against the potential benefit.

That’s rather unexpected. Which size does the snapshot “e717c1417c4298792fa21938accb913e80c3a30ad631c95429ba2963d5901284” have on the filesystem?
In a quick test with a damaged snapshot, check behaves as expected for me.

You can manually delete the file named “e717c1417c4298792fa21938accb913e80c3a30ad631c95429ba2963d5901284” from the snapshots folder. But please create a copy of that file as it might still be useful to debug why check didn’t report an error.

Thanks. I deleted that file and normal operation resumed: no errors.

I also checked by “restic snapshots” and then did a “restic restore” of one of the snapshots created: all good.

The offending file is 319 bytes. It is binary, not text. Sent as attachment to Alex.

I cannot offer support via email. The file is encrypted with a key that nobody besides you have. The only thing I can see is that the SHA256 hash matches the file name.

Sorry, misunderstanding: I thought you wanted to see the file.

Now I’m confused. That looks like there’s something very wrong with the host on which the backups failed. Either some filesystem cache returns garbage or the hardware is causing bitflips.

Interesting to get your view on the matter.

This is an internal hard drive, as I say, so the “host” is W10, and this repo has been operational for over a year, and to my knowledge such an error has never occurred before: I have arranged things so that any restic error will result in an error file (icon) appearing on my desktop. And I also check these repos each week, doing partial restores.