Unexpected repo corruption

So I have my grandpa’s machine backing up to my repository every Sunday night. He pays it no mind whatsoever, and will shut the lid and put it to sleep in the middle of a backup. This has never been a problem before - I’ll just see there’s a lock, and wait for him to use his computer and have it finish up.

Today is different, however. I went to do a diff to ensure his pictures were backing up from his phone (they sync via Dropbox, and then the script moves older ones out of Dropbox and backs them up with Restic) - however, I got this:

repository 24344e18 opened (version 2, compression level max)
comparing snapshot 1925d32e to 960bbec0:
id 88fc2b342254438ec7e91325ed6f1f330087f625628f6778d19c7ef38930e2c0 not found in repository
github.com/restic/restic/internal/repository.(*Repository).LoadBlob
github.com/restic/restic/internal/repository/repository.go:274
github.com/restic/restic/internal/restic.LoadTree
github.com/restic/restic/internal/restic/tree.go:113
main.(*Comparer).diffTree
github.com/restic/restic/cmd/restic/cmd_diff.go:240
main.runDiff
github.com/restic/restic/cmd/restic/cmd_diff.go:408
main.glob..func8
github.com/restic/restic/cmd/restic/cmd_diff.go:39
github.com/spf13/cobra.(*Command).execute
github.com/spf13/cobra@v1.6.1/command.go:916
github.com/spf13/cobra.(*Command).ExecuteC
github.com/spf13/cobra@v1.6.1/command.go:1044
github.com/spf13/cobra.(*Command).Execute
github.com/spf13/cobra@v1.6.1/command.go:968
github.com/spf13/cobra.(*Command).ExecuteContext
github.com/spf13/cobra@v1.6.1/command.go:961
main.main
github.com/restic/restic/cmd/restic/main.go:96
runtime.main
runtime/proc.go:250
runtime.goexit
runtime/asm_amd64.s:1594

I did a “check” and it said:

Fatal: unable to create lock in backend: repository is already locked by PID 25808 on Herb-PC by HERB-PC\herbp (UID 0, GID 0)
lock was created at 2023-02-20 00:35:24 (90h55m4.8375569s ago)

I did a diff between the two snapshots BEFORE the latest one, and it worked fine. It seems as if only the last snapshot is corrupted. I have not unlocked the repo and ran a check, because it very well might finish up the snapshot the next time he uses his computer. He only uses it a couple times a week.

It was my understanding Restic only writes the snapshot file after everything is complete, and so… shouldn’t this not be possible??

I’m running restic 0.15.1 compiled with go1.19.5 on darwin/amd64 and the repo is on B2.

Yes, restic only writes the snapshot at the end, after it has settled all the data into the repository.

I’m not sure what you are asking. Are you saying that one of the snapshots you are diffing should not exist when there is a lock? If so, why?

I think you need to clarify the details a bit. E.g. show the list of snapshots involved, etc.

So I waited a few days and his machine did try to finish the snapshot, but the log showed this:

using parent snapshot 960bbec0
creating VSS snapshot for [c:\]
error: tree 88fc2b342254438ec7e91325ed6f1f330087f625628f6778d19c7ef38930e2c0 is not known; the repository could be damaged, run `rebuild-index` to try to repair it
successfully created snapshot for [c:\]

Files:       13853 new,     0 changed,     0 unmodified
Dirs:         1177 new,     0 changed,     0 unmodified
Added to the repository: 103.106 MiB (73.218 MiB stored)

processed 13853 files, 46.070 GiB in 16:03
snapshot 964d528f saved
Warning: at least one source file could not be read

I did run rebuild-index, then ran the “repair” PR. It recommended deleting 960bbec0 - so I did. Everything seems to check out now, including the latest snapshot 964d528f.

So I don’t think 960bbec0 existed while it was still backing up - it was actually trying to use that as the parent.

I guess the only question that remains is… why would 960bbec0 get corrupted in the first place? But… that would be a hard thing to track down at this point, so we can probably let this drop.

My guess is that his machine went to sleep in the middle of saving the snapshot, at the very end - and that’s what corrupted it. But I would think that B2 would have saved that as a partial file, and Restic wouldn’t have seen it. So I really have no idea. :man_shrugging:

Also here’s the backup log from the corrupted (now deleted) snapshot:

using parent snapshot 1925d32e
creating VSS snapshot for [c:\]
successfully created snapshot for [c:\]
b2_download_file_by_name: 404: : b2.b2err

Files:          21 new,    34 changed, 13801 unmodified
Dirs:            4 new,    62 changed,  1113 unmodified
Added to the repository: 67.566 MiB (54.110 MiB stored)

processed 13856 files, 46.045 GiB in 441:11:20
snapshot 960bbec0 saved

I probably need to switch him to the AWS protocol instead, but other than this one snapshot it’s been working fine. He lives 3,000 miles away from me, and the remote connection I had setup has failed - so it’s a little difficult to troubleshoot. I’ll see how it does next week.