Requesting strategy check for recovery from bad repository

TomCloyd · March 7, 2022, 8:06am

I run restic backups approximately twice daily - 6 repositories which encompass most of my /home directory. Weekly I do a prune and a “check --read-data” run on all 6. When that is successful I do an rsync copy of the repositories to a second HD stored apart from my regular backup.

It is not uncommon to have some kind of error on a particular repository that is approximately 60 GiB, but when I rerun the “check…” the error is not there. This week the identical error remained when I did the check again:

tomc@tomc-Galago-Pro:~/programs/restic-backup/restic-backup-scripts$ restic -p /home/tomc/Dropbox/ssap -r /media/tomc/seagate-4-b/sys76-backup/bk-hid3a check --read-data
using temporary cache in /tmp/restic-check-cache-048582957
repository 02047da4 opened successfully, password is correct
created new cache in /tmp/restic-check-cache-048582957
create exclusive lock for repository
load indexes
check all packs
check snapshots, trees and blobs
error for tree 9467f2a2:pshots
  decrypting blob 9467f2a2715355d728bcb799b8aa6e1a4160f3b14640cbd69fb198b2f3226aea failed: ciphertext verification failed
[2:36] 100.00%  12 / 12 snapshots
read all data
Pack ID does not match, want 9f49600a, got de4e4f39
[27:57] 100.00%  11448 / 11448 packs
Fatal: repository contains errors
tomc@tomc-Galago-Pro:

My only thought about recovery from this is to

manually delete the repository.
copy the off-site week-old backup copy of this repository back to my main backup HD
run my usual backup of the repository to update it
run a prune and “check --read-data” on the result and hope that this time all is well.
if so, then make my weekly copy back to backup HD#2

Does this make sense? Is there a better solution?

My environment particulars:
restic 0.12.1 compiled with go1.16.6 on linux/amd64
System76 Galago Pro (galp5) laptop
RAM: 16 GB Dual Channel DDR4 at 3200 MHz
Storage: 500 GIB SSD
Operating System: Kubuntu 20.04.3 LTS
Kernel Version: 5.16.11-76051611-generic

silmaril42 · March 7, 2022, 7:08pm

If I understand correctly, you sometimes get different results when reading 60 GiB of data several times.
That’s about 5e11 bits and one single bit error would be enough to get a bad result.

I don’t know what bit error rates current notebook hardware is supposed to have, but I wouldn’t be surprised if this was still inside the specs. I assume you are not using ECC RAM?

Since you can get valid results when reading again, I’d assume that the data on the SSD is correct.

The problem I see is that even if you have a successful check run, there might be bit errors during the read while backing up the repo on HD#2.

I’m not sure how this might be improved by using better restic magic, but one thing you could try is make sure your laptop has enough air to cool itself during this (don’t put it on the couch) and maybe check if it gets quite warm during the operation - electronic noise generally increases with temperature and so would the probability of bit errors.

TomCloyd · March 7, 2022, 7:17pm

Thanks for your response!

I don’t know if my laptop has ECC RAM - I suspect not, since it is a consumer laptop and not set up as a server. I gave the specs that I know in my original post - “16 GB Dual Channel DDR4 at 3200 MHz”, but that may not provide enough information to answer the question.

As for laptop overheating - it rests on an elevated laptop support with substantial ventilation holes - I doubt that I could improve that setup.

nicnab · March 7, 2022, 8:23pm

It’s a fascinating realization for me too (as I also said before): I think as a standard computer user one rarely has a chance to notice the limitations of hardware. Either systems auto correct (like some file systems and RAM on pro systems) or you simply never notice single bit errors in the heaps of data you have flying around. And then restic comes along and takes every bit flip very serious

TomCloyd · March 7, 2022, 8:30pm

I am laughing. Innocence is bliss…and restic shoves us rudely into the real world. Ouch!

An update on my situation: I have just completed the proposed procedure outlined in my initial post: Having a copy of verified pruned backup from last weekend, I deleted my apparently corrupted current one, copied the verified backup copy into its place, ran a backup update, prune, and “check --read-data” and all is well, it seems. This current backup will now be rsynced with my backup copy, and life goes on. All I’ve lost is a week of incremental backups, and the only major file involved is that huge hidden “.local” dir in my /home partition. This seems to me like a decent quick-fix.

MichaelEischer · March 7, 2022, 9:36pm

For future reference, the current instructions to repair a repository are at Recover from broken pack file · Issue #828 · restic/restic · GitHub .

nicnab · March 7, 2022, 9:56pm

This is brilliant and I’ve never seen it before Thanks for writing and posting this here!

TomCloyd · March 7, 2022, 10:14pm

For myself, and others, I thank you for this link. This appears to be a deeply thoughtful analysis of response options. Very much appreciated!

silmaril42 · March 8, 2022, 6:43am

It sounds as if this problem was some kind of normal behaviour we should expect when dealing with large amounts of data.

As long as it’s just bit errors while reading data that has been saved to disk correctly, it’s not a real problem, restic check just makes it look like one.

Maybe this could be improved by adding an option to restic check --read-data that re-reads blocks with errors for a given number of tries and accepts a block as “ok” if it can read it correctly once.
I’d think this is justified, because the odds of reading an erroneous block and accidentally correcting it to the correct data are very close to zero.

One difficulty of the implementation is that we’d have to make sure that the data is really read anew from wherever it’s saved and we don’t just read the same bytes again from a cache (have to flush disk cache before re-reading).

Any opinions on this idea?

MichaelEischer · March 10, 2022, 8:05pm

Bit flips while reading data often also indicate bitflips happening at less convenient times. If read errors occur somewhat frequently, then there’s also a certain chance that the stored data can also get damaged.

Have a look at Redownload files with wrong hash by MichaelEischer · Pull Request #3521 · restic/restic · GitHub . That implements a single retry if the read file does not match its hash.