How to recover from nonce is invalid error on Prune?

darkmattercoder · July 14, 2024, 8:41am

I am using restic to daily backup several directories to remote hetzner storage boxes. The backups are scheduled nightly using resticprofile. I do also daily check --read-data-subset operations, so that at least after a while most of the repository is visited for consistency checks. I rarely do full --read-data operations because of the repositories being relatively large and a full check of the smaller repo takes about 15 hours, and the large one about 60 hours to complete and the sftp backends cannot handle reconnections, so a full run won’t succeed at al for those long durations.

So far I am really happy with my setup.
It was a few days ago when I came home from a trip and noticed that apparently my server had been powered off due to a power outage while I was away and apparently had some data corruption on the file system. Some services were not able to read their databases, inconsistent or null-byte files were present on my home partition, and several other damages.

So I decided to restore a backup. I accidentally first selected a wrong snapshot (after the power outage) to restore, because I copied the wrong hash. I quickly realized that because the operation gave a lot of cannot decrypt block xyz - nonce is invalid errors.

I then ran the restore on the last snapshot before the outage, which successfully gave me a consistent data base back for my partition.

I then came to the conclusion that apparently the snapshots that were taken after the outage before I came back, were somehow suffering from inconsistency for whatever reason. Is it even possible that inconsistent data in the backup make the snapshot inconsistent?

To get rid of those apparently non-trustable snapshots (3 since the one I restored) I ran restic forget on them.

Now when I try to prune them, I get this:

repository 58d6205a opened (version 2, compression level auto)
loading indexes...
[0:02] 100.00%  70 / 70 index files loaded
loading all snapshots...
finding data that is still in use for 435 snapshots
[0:02] 7.59%  33 / 435 snapshots
decrypting blob aec55db63d2758aca645b1d8596a69a26bd0ed3c62ffb63742e63dd48cdaa14f failed: nonce is invalid
github.com/restic/restic/internal/repository.(*Repository).LoadBlob
        /restic/internal/repository/repository.go:315
github.com/restic/restic/internal/restic.LoadTree
        /restic/internal/restic/tree.go:115
github.com/restic/restic/internal/restic.loadTreeWorker
        /restic/internal/restic/tree_stream.go:36
github.com/restic/restic/internal/restic.StreamTrees.func1
        /restic/internal/restic/tree_stream.go:176
golang.org/x/sync/errgroup.(*Group).Go.func1
        /home/build/go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1695

I did a full restic check --read-data-subset [1..10]/10 run on the repo before I even started the restoring. And I also did the same right after I ran into that prune error. Both times restic reports no error in the repository.

How do I successfully proceed with pruning from here on?
I found: restic failed- nonce is invalid error | it’s like awesome you know which describes the same issue and also presents a possible solution.

However, I struggle hard to actually understand what’s going on with my repo and whether that solution from the link is the right approach. Especially as I find no similar situations elsewhere.

Thanks in advance.

P.S.: regarding the aforementioned issues with the large repo and not being able to do full read data checks on large repos, I want to share the following:

The ssh connection to hetzner is always cut at some point and cannot recover. I found that this can be solved by switching to restics rclone backend with rclone over ssh, instead the native sftp backend in restic. The former one handles reconnection well. Given that fact, I will do full --read-data checks in the future by switching my backend to rclone for all operations.

MichaelEischer · July 15, 2024, 6:20pm

Which restic version are you using? Judging from the stacktrace it looks like restic 0.16.5?

My guess would be that the power outage broke the local cache used by restic. check uses a new cache, which would perfectly explain the observed behavior. The cache location is shown by the restic cache command, just delete the cache. Afterwards, prune hopefully works again.

If not, you can follow the guide at Troubleshooting — restic 0.16.5 documentation . That guide is the much easier and largely automated way to fix a broken repository.

MichaelEischer · July 15, 2024, 6:22pm

In terms of guaranteeing the integrity of a repository, restic’s built-in sftp backend is slightly preferable to rclone (see Restic to remote rclone server does not properly retry failed uploads on network outage · Issue #3834 · restic/restic · GitHub ).

darkmattercoder · July 18, 2024, 6:19am

My guess would be that the power outage broke the local cache used by restic. check uses a new cache, which would perfectly explain the observed behavior. The cache location is shown by the restic cache command, just delete the cache. Afterwards, prune hopefully works again.

Absolutely correct. Deleting the cache was sufficient. All went well with a fresh cache. Maybe that was worth a note in the troubleshooting section as an easy first aid measure.

Thank you

MichaelEischer · July 18, 2024, 8:04pm

restic 0.17.0 will automatically detect and remove broken files from the cache. So, manual intervention will no longer be necessary. Besides that this particular error seems to have been very rare with recent restic versions.

How to recover from *nonce is invalid* error on Prune?

How to recover from nonce is invalid error on Prune?