Restic fails after 2 days

I can’t fathom this.
created a snapshot yesterday.
I knew I was going to re-install the OS
did the reinstall.
now I’d like my data back.
the snapshot was saved to an attached external hard drive.
so ran the command # restic - r /media/path/to/drive//repository restore snapshotnumber --target /home/saywot/Desktop

entered the password, it was correct
then literally hundreds of errors ending in ciphertext mismatch are thrown up.

all the hardware is new (ish) and unchanged
the restic version wouldn’t have changed, the OS ( Fedora Linux 30) is the same

so my question is - how do I get the data back if restic fails every file in the snapshot ?

but to be more precise, the errors looks like this
ignoring error for /home/saywot/Videos/Video s2/A_Videos02e01.mp4: decrypting blob 59ceb49bcf6dda535b12f23816b05e25a718db8b47f303897d06a79ae7c0d157 failed: ciphertext verification failed
I don’t think this is going to work for me at all

Hey, welcome to the forum!

I’m sorry that restic does not work for you, but it looks like there’s a hardware fault. What restic tells you is that the data written to the external hard drive was modified, so the signature does not match. Restic encrypts and signs all data it writes, and the signature is checked before it uses any data read from the repository.

You can try to find out what’s going on by running restic check --read-data, which will read and verify all the data stored on the external hard drive. I suspect you will see the same (or similar) errors, but it may help us to pin down the issue.

The issue could have happened on different levels:

  • The data was modified in memory while restic was running during backup
  • The data was modified in memory before writing it to the disk by the file system
  • The harddrive is faulty, the data read back is not the same as the data written to it
  • The data was modified in memory after reading it from the hard disk

I think it’s likely that either the memory in your machine or the harddrive used to store the data is faulty. I’m sorry if that’s not what you wanted to hear, but these kinds of problems were detected by restic several times in the past already.

I’d do two things:

  • Run memtest on the machine for several hours
  • Run sha256sum on the files in the restic repo, the file name for the files must match the output of sha256sum. If it does not match, it’s likely an issue with the hard disk
  • Run restic check --read-data so we can get an idea on which layer the corruption has happened.

Good luck! And please report back if you have any results :slight_smile:

1 Like

Before I shut the machine down for a few hours, did you note that the last snapshot (of 12) was created yesterday, it’s unlikely that the HDD is faulty (S.M.A.R.T says it’s good) and do you think that restic needs all of the 32Gb RAM ?

In the past restic has surfaced many system faults - so I would not dimiss the fact and the info @fd0 has provided.
There are many ways this could’ve failed as he lined out above.

Unlikely, but still possible.

This isn’t relevant; if the memory is faulty, the question is whether restic used any of the faulty regions.

I’d just like to note that S.M.A.R.T can be very unreliable in my experiences and reports I’ve read.

SMART is very good at showing that there are problems, but it cannot reliably show that there aren’t problems. (In other words, the only time you can really trust it is when it says that something is wrong.)

1 Like

so I’ve run " restic check --read-data" since my last log-in
there’s hundreds of ciphertext mismatch
wanted ####### got ######## by the bucket load
so if either my computer HDD is jiggered, the storage device has suddenly gone bad or all the memory is faulty that still doesn’t answer my question.
I need to recover my data.
Restic never presented itself to me as “a back-up solution (but only if your hardware is 100% new and faultless”
If I’d know that it was risky I’d have just copied the directories onto the HDD

Any backup tool is going to depend on your storage media and RAM not being faulty. “100% new” is unhelpful hyperbole; nobody claims that hardware has to be new.

But it does have to be free of uncorrectable faults. This applies no matter what software you are running.

And you’d likely have the same problem with this approach, or even with any other backup tool.

You can’t blame software for malfunctioning in the presence of hardware errors. The software can only run as well as the hardware allows it.

There’s two important steps you missed in a proper backup procedure, and this applies to all backup systems, not just restic:

  • You need to regularly test your backups by restoring them using a different system. restic check --read-data is also good, but this only tells you that what restic has stored is intact; it doesn’t tell you if restic stored the right data to begin with.
  • The standard 3-2-1 backup process has you keep two copies on-site and an additional copy off-site. You should periodically test all copies. Betting your backups on a single disk is risky, as you have found out.

I know that none of this is what you want to hear and I’m not trying to “shame the victim.” However, putting the blame on restic when either your RAM or storage hardware appears to be faulty is not productive nor accurate. It gives you a target for your anger and frustration right now, but it won’t help you avoid this problem in the future.

2 Likes

I get that you’re probably frustrated right now, but this has nothing to do with restic.
If you have hardware that is faulty this could’ve happened as well if you “just” copied the data to an external drive.
I 100% agree with @cdhowie post.

Well said @cdhowie.

I’m new to restic, but I still check my backup once in a while.

And when I know I’m going to wipe my system I’m extra careful. I do a regular backup, I make a copy with rsync to an external drive and I check that both are valid. Having both never failed me so far.

Phew !

I think 6 hours of running Memtest with no errors is enough, it’s at pass #Gazillion or something like that.

So now I’m going to have a crack at moving a few large files from the install HDD and see if they maintain integrity upon return, then the same for the Backup Drive

That’s a good plan – I would also suggest moving the restic repository (and everything else) off of the backup drive, then totally filling it up with some data that you can compare with e.g. diff -qr or a checksum utility. It’s possible that any damage on that HDD could exist only where the restic repository now lives.

Note that it’s also possible that the degredation happens over time and not immediately. You might want to leave the data on there for a few days and then check again to rule that out as well.

Can you please try running sha256sum on a few of the files on the disk containing the repository? It would be helpful to know if the hash is equal to the file name for almost all or just a few files so we can assess the damage and make an educated guess what may have happened…

I’ve written a small program to do that in parallel here: https://github.com/fd0/psha

You can build it with Go >= 1.11 by checking out the repo and then running go build. You can call it like this: ./psha /path/to/repo. For a working, healthy repository it should only complain about the file config.

If I were you (but I’m not), I would use Btrfs or ZFS. Even with no redundancy on a single drive. The benefit is that all data and metadata extents/blocks are checksummed as they are written. If the checksums differ when reading back, you’ll get an error. An error is generally preferred to silent corruption.

I’m up to hostname “b15”, which means I’ve built (or bought in the case of laptops) 15 computers. And at least that many work laptops/desktops.

I can tell you from experience, the newness and/or relative horsepower of hardware has nothing to do with whether or not it can or will corrupt data.

There are several things that you as a user can do to improve the reliability of most desktop setups, and even laptops:

  • Use ECC RAM. If your system doesn’t support it, make a note to use it next time. As RAM capacities grow exponentially, the odds of cosmic rays flipping bits and corrupting data grows.
  • Store your working data and backups on Btrfs or ZFS redundant arrays. Even if working on a laptop.
    • Both Btrfs and ZFS work great on mirrored SSDs and NVMe. (Or even single SSD/NVMe, to give you silent corruption detection.) Neither is quite as fast as ext4 or NTFS, but still significantly faster than HDDs.
    • Since both are complex software and occasionally have bugs, it’s not a bad idea to use both filesystems on different machines. One for working data via laptop, the other for local backup on a server. I also stagger my linux updates to reduce the odds of an bad update killing all my data at once.
  • For each desktop/server, split the drives (even if just two) across more than one storage controller. Storage controllers are common sources of silent corruption. (It doesn’t happen often, but when it does, it’s often a tossup between memory, storage controller, or drive.)
    • For example, two SAS controllers with SATA fanout cables - or for a cheaper solution, one on-board SATA controller and one cheap PCIe SATA card.
    • I even do this with external chasses. For example, one server has three 5-slot external chasses connected. Each 3-way mirror spans all three chasses. One chassis is plugged into a dedicated USB 3.0 card. The second is plugged into a second dedicated USB 3.0 card. The third is plugged into a dedicated eSATA card. That way, I can lose two whole controllers, or two whole chassis, or even have a catastrophic driver problem, and not lose data.
  • Make sure the drives are different brands, models, or less ideally, batches of the same model widely separated in time. Whether SSD or especially HDD.

Good luck.