I am using restic to daily backup several directories to remote hetzner storage boxes. The backups are scheduled nightly using resticprofile. I do also daily check --read-data-subset
operations, so that at least after a while most of the repository is visited for consistency checks. I rarely do full --read-data
operations because of the repositories being relatively large and a full check of the smaller repo takes about 15 hours, and the large one about 60 hours to complete and the sftp backends cannot handle reconnections, so a full run won’t succeed at al for those long durations.
So far I am really happy with my setup.
It was a few days ago when I came home from a trip and noticed that apparently my server had been powered off due to a power outage while I was away and apparently had some data corruption on the file system. Some services were not able to read their databases, inconsistent or null-byte files were present on my home partition, and several other damages.
So I decided to restore a backup. I accidentally first selected a wrong snapshot (after the power outage) to restore, because I copied the wrong hash. I quickly realized that because the operation gave a lot of cannot decrypt block xyz - nonce is invalid
errors.
I then ran the restore on the last snapshot before the outage, which successfully gave me a consistent data base back for my partition.
I then came to the conclusion that apparently the snapshots that were taken after the outage before I came back, were somehow suffering from inconsistency for whatever reason. Is it even possible that inconsistent data in the backup make the snapshot inconsistent?
To get rid of those apparently non-trustable snapshots (3 since the one I restored) I ran restic forget
on them.
Now when I try to prune them, I get this:
repository 58d6205a opened (version 2, compression level auto)
loading indexes...
[0:02] 100.00% 70 / 70 index files loaded
loading all snapshots...
finding data that is still in use for 435 snapshots
[0:02] 7.59% 33 / 435 snapshots
decrypting blob aec55db63d2758aca645b1d8596a69a26bd0ed3c62ffb63742e63dd48cdaa14f failed: nonce is invalid
github.com/restic/restic/internal/repository.(*Repository).LoadBlob
/restic/internal/repository/repository.go:315
github.com/restic/restic/internal/restic.LoadTree
/restic/internal/restic/tree.go:115
github.com/restic/restic/internal/restic.loadTreeWorker
/restic/internal/restic/tree_stream.go:36
github.com/restic/restic/internal/restic.StreamTrees.func1
/restic/internal/restic/tree_stream.go:176
golang.org/x/sync/errgroup.(*Group).Go.func1
/home/build/go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1695
I did a full restic check --read-data-subset [1..10]/10
run on the repo before I even started the restoring. And I also did the same right after I ran into that prune
error. Both times restic reports no error in the repository.
How do I successfully proceed with pruning from here on?
I found: restic failed- nonce is invalid error | it’s like awesome you know which describes the same issue and also presents a possible solution.
However, I struggle hard to actually understand what’s going on with my repo and whether that solution from the link is the right approach. Especially as I find no similar situations elsewhere.
Thanks in advance.
P.S.: regarding the aforementioned issues with the large repo and not being able to do full read data
checks on large repos, I want to share the following:
The ssh connection to hetzner is always cut at some point and cannot recover. I found that this can be solved by switching to restics rclone backend with rclone over ssh, instead the native sftp backend in restic. The former one handles reconnection well. Given that fact, I will do full --read-data
checks in the future by switching my backend to rclone for all operations.