Restic fails after 2 days

saywot · September 24, 2019, 5:25am

I can’t fathom this.
created a snapshot yesterday.
I knew I was going to re-install the OS
did the reinstall.
now I’d like my data back.
the snapshot was saved to an attached external hard drive.
so ran the command # restic - r /media/path/to/drive//repository restore snapshotnumber --target /home/saywot/Desktop

entered the password, it was correct
then literally hundreds of errors ending in ciphertext mismatch are thrown up.

all the hardware is new (ish) and unchanged
the restic version wouldn’t have changed, the OS ( Fedora Linux 30) is the same

so my question is - how do I get the data back if restic fails every file in the snapshot ?

saywot · September 24, 2019, 6:52am

but to be more precise, the errors looks like this
“ignoring error for /home/saywot/Videos/Video s2/A_Videos02e01.mp4: decrypting blob 59ceb49bcf6dda535b12f23816b05e25a718db8b47f303897d06a79ae7c0d157 failed: ciphertext verification failed”
I don’t think this is going to work for me at all

fd0 · September 24, 2019, 8:44am

Hey, welcome to the forum!

I’m sorry that restic does not work for you, but it looks like there’s a hardware fault. What restic tells you is that the data written to the external hard drive was modified, so the signature does not match. Restic encrypts and signs all data it writes, and the signature is checked before it uses any data read from the repository.

You can try to find out what’s going on by running restic check --read-data, which will read and verify all the data stored on the external hard drive. I suspect you will see the same (or similar) errors, but it may help us to pin down the issue.

The issue could have happened on different levels:

The data was modified in memory while restic was running during backup
The data was modified in memory before writing it to the disk by the file system
The harddrive is faulty, the data read back is not the same as the data written to it
The data was modified in memory after reading it from the hard disk

I think it’s likely that either the memory in your machine or the harddrive used to store the data is faulty. I’m sorry if that’s not what you wanted to hear, but these kinds of problems were detected by restic several times in the past already.

I’d do two things:

Run memtest on the machine for several hours
Run sha256sum on the files in the restic repo, the file name for the files must match the output of sha256sum. If it does not match, it’s likely an issue with the hard disk
Run restic check --read-data so we can get an idea on which layer the corruption has happened.

Good luck! And please report back if you have any results

saywot · September 24, 2019, 9:25am

Before I shut the machine down for a few hours, did you note that the last snapshot (of 12) was created yesterday, it’s unlikely that the HDD is faulty (S.M.A.R.T says it’s good) and do you think that restic needs all of the 32Gb RAM ?

moritzdietz · September 24, 2019, 10:06am

In the past restic has surfaced many system faults - so I would not dimiss the fact and the info @fd0 has provided.
There are many ways this could’ve failed as he lined out above.

cdhowie · September 24, 2019, 3:14pm

Unlikely, but still possible.

This isn’t relevant; if the memory is faulty, the question is whether restic used any of the faulty regions.

cfbao · September 24, 2019, 4:00pm

I’d just like to note that S.M.A.R.T can be very unreliable in my experiences and reports I’ve read.

cdhowie · September 24, 2019, 4:02pm

SMART is very good at showing that there are problems, but it cannot reliably show that there aren’t problems. (In other words, the only time you can really trust it is when it says that something is wrong.)

saywot · September 25, 2019, 12:24am

so I’ve run " restic check --read-data" since my last log-in
there’s hundreds of ciphertext mismatch
wanted ####### got ######## by the bucket load
so if either my computer HDD is jiggered, the storage device has suddenly gone bad or all the memory is faulty that still doesn’t answer my question.
I need to recover my data.
Restic never presented itself to me as “a back-up solution (but only if your hardware is 100% new and faultless”
If I’d know that it was risky I’d have just copied the directories onto the HDD

cdhowie · September 25, 2019, 12:51am

Any backup tool is going to depend on your storage media and RAM not being faulty. “100% new” is unhelpful hyperbole; nobody claims that hardware has to be new.

But it does have to be free of uncorrectable faults. This applies no matter what software you are running.

And you’d likely have the same problem with this approach, or even with any other backup tool.

You can’t blame software for malfunctioning in the presence of hardware errors. The software can only run as well as the hardware allows it.

There’s two important steps you missed in a proper backup procedure, and this applies to all backup systems, not just restic:

You need to regularly test your backups by restoring them using a different system. restic check --read-data is also good, but this only tells you that what restic has stored is intact; it doesn’t tell you if restic stored the right data to begin with.
The standard 3-2-1 backup process has you keep two copies on-site and an additional copy off-site. You should periodically test all copies. Betting your backups on a single disk is risky, as you have found out.

I know that none of this is what you want to hear and I’m not trying to “shame the victim.” However, putting the blame on restic when either your RAM or storage hardware appears to be faulty is not productive nor accurate. It gives you a target for your anger and frustration right now, but it won’t help you avoid this problem in the future.

moritzdietz · September 25, 2019, 9:14am

I get that you’re probably frustrated right now, but this has nothing to do with restic.
If you have hardware that is faulty this could’ve happened as well if you “just” copied the data to an external drive.
I 100% agree with @cdhowie post.

MorgothSauron · September 25, 2019, 5:20pm

Well said @cdhowie.

I’m new to restic, but I still check my backup once in a while.

And when I know I’m going to wipe my system I’m extra careful. I do a regular backup, I make a copy with rsync to an external drive and I check that both are valid. Having both never failed me so far.

saywot · September 26, 2019, 1:52am

Phew !

I think 6 hours of running Memtest with no errors is enough, it’s at pass #Gazillion or something like that.

So now I’m going to have a crack at moving a few large files from the install HDD and see if they maintain integrity upon return, then the same for the Backup Drive

cdhowie · September 26, 2019, 2:13am

That’s a good plan – I would also suggest moving the restic repository (and everything else) off of the backup drive, then totally filling it up with some data that you can compare with e.g. diff -qr or a checksum utility. It’s possible that any damage on that HDD could exist only where the restic repository now lives.

Note that it’s also possible that the degredation happens over time and not immediately. You might want to leave the data on there for a few days and then check again to rule that out as well.

fd0 · September 26, 2019, 7:00am

Can you please try running sha256sum on a few of the files on the disk containing the repository? It would be helpful to know if the hash is equal to the file name for almost all or just a few files so we can assess the damage and make an educated guess what may have happened…

I’ve written a small program to do that in parallel here: https://github.com/fd0/psha

You can build it with Go >= 1.11 by checking out the repo and then running go build. You can call it like this: ./psha /path/to/repo. For a working, healthy repository it should only complain about the file config.

jim-collier · September 26, 2019, 10:11am

If I were you (but I’m not), I would use Btrfs or ZFS. Even with no redundancy on a single drive. The benefit is that all data and metadata extents/blocks are checksummed as they are written. If the checksums differ when reading back, you’ll get an error. An error is generally preferred to silent corruption.

I’m up to hostname “b15”, which means I’ve built (or bought in the case of laptops) 15 computers. And at least that many work laptops/desktops.

I can tell you from experience, the newness and/or relative horsepower of hardware has nothing to do with whether or not it can or will corrupt data.

There are several things that you as a user can do to improve the reliability of most desktop setups, and even laptops:

Use ECC RAM. If your system doesn’t support it, make a note to use it next time. As RAM capacities grow exponentially, the odds of cosmic rays flipping bits and corrupting data grows.
Store your working data and backups on Btrfs or ZFS redundant arrays. Even if working on a laptop.
- Both Btrfs and ZFS work great on mirrored SSDs and NVMe. (Or even single SSD/NVMe, to give you silent corruption detection.) Neither is quite as fast as ext4 or NTFS, but still significantly faster than HDDs.
- Since both are complex software and occasionally have bugs, it’s not a bad idea to use both filesystems on different machines. One for working data via laptop, the other for local backup on a server. I also stagger my linux updates to reduce the odds of an bad update killing all my data at once.
For each desktop/server, split the drives (even if just two) across more than one storage controller. Storage controllers are common sources of silent corruption. (It doesn’t happen often, but when it does, it’s often a tossup between memory, storage controller, or drive.)
- For example, two SAS controllers with SATA fanout cables - or for a cheaper solution, one on-board SATA controller and one cheap PCIe SATA card.
- I even do this with external chasses. For example, one server has three 5-slot external chasses connected. Each 3-way mirror spans all three chasses. One chassis is plugged into a dedicated USB 3.0 card. The second is plugged into a second dedicated USB 3.0 card. The third is plugged into a dedicated eSATA card. That way, I can lose two whole controllers, or two whole chassis, or even have a catastrophic driver problem, and not lose data.
Make sure the drives are different brands, models, or less ideally, batches of the same model widely separated in time. Whether SSD or especially HDD.

Good luck.

saywot · December 28, 2019, 2:23am

I think this will be the last time I try and run restic as a data security programme - it just doesn’t work.
What have I had to do so far ?
buy new RAM and an SSD (because of the checksum mismatch) - it did nothing a trial restore failed.
uninstalled then reinstalled the application.
buy a new local storage device
run a trial backup and restore - failed again.
I’m at my wits end, haven’t a clue what “split the drives across two controllers” even means.
I’ve wasted so much time and money on what was supposed to be quick and painless backing-up and restoring that there’s no possible way I could recommend using restic to anyone I know who I don’t hate.
.

cdhowie · December 28, 2019, 3:58am

If using SATA drives, for example, put each drive on a different SATA controller to avoid a controller problem from silently corrupting both backups. If using USB drives, the same thing applies. The simplest way to ensure the drives are on different controllers is to use one internal SATA drive and one external USB drive.

This is a little harsh and unnecessarily antagonistic to both the restic author and the forum volunteers who have tried to help you. None of us get paid for this work.

It’s entirely possible that this is a restic bug that, for whatever reason, is only surfacing for you. In my opinion, based on the symptoms, it’s more likely there is a hardware problem that we haven’t isolated yet. Many people use restic daily without issue. We are using it across dozens of servers and have not had any instances of corruption or failed restores yet.

If you use another tool, make sure that it also stores checksums of your data and validate it. If my suspicions are correct, sooner or later you’re going to run into the same issue with another tool.

rawtaz · December 29, 2019, 12:32am

I would very much appreciate if you could do what @fd0 wrote here. It’s a core part in isolating the issue.

On a more general note, I would be weary of the USB controller, even though I’m not saying this is the cause of the problems. Is it possible to (physically) mount the hard drive on a SATA controller/cable internally in the/a computer, and try running restic and the check @fd0 mentioned there?

Thanks.

PS: I understand that you are frustrated. It makes total sense.

saywot · April 8, 2020, 12:47pm

I have no idea what @fdo means
I don’t know what running sha256sun on a few files means
I have no clue what is meant by “you can build it with Go>=1.11”
and I’ve connected the 2.5" drive into the PC on a spare SATA channel with no change.
I did however run
restic -r /path/to/backup forget
and forgot a few of the 20 snapshots
then I ran prune and it seemed to fail with a final line
“hash does not match id: want 4fbe166121638c2125260e964d55f8c10526468b4d77dd221f03d3283c206258, got 1cc3131f6cc1c5442d09e444c8723cf3a897da4d01b9710a8a9433b9dddeb30c”