Restic fails after 2 days

In the past restic has surfaced many system faults - so I would not dimiss the fact and the info @fd0 has provided.
There are many ways this could’ve failed as he lined out above.

Unlikely, but still possible.

This isn’t relevant; if the memory is faulty, the question is whether restic used any of the faulty regions.

I’d just like to note that S.M.A.R.T can be very unreliable in my experiences and reports I’ve read.

SMART is very good at showing that there are problems, but it cannot reliably show that there aren’t problems. (In other words, the only time you can really trust it is when it says that something is wrong.)

2 Likes

so I’ve run " restic check --read-data" since my last log-in
there’s hundreds of ciphertext mismatch
wanted ####### got ######## by the bucket load
so if either my computer HDD is jiggered, the storage device has suddenly gone bad or all the memory is faulty that still doesn’t answer my question.
I need to recover my data.
Restic never presented itself to me as “a back-up solution (but only if your hardware is 100% new and faultless”
If I’d know that it was risky I’d have just copied the directories onto the HDD

Any backup tool is going to depend on your storage media and RAM not being faulty. “100% new” is unhelpful hyperbole; nobody claims that hardware has to be new.

But it does have to be free of uncorrectable faults. This applies no matter what software you are running.

And you’d likely have the same problem with this approach, or even with any other backup tool.

You can’t blame software for malfunctioning in the presence of hardware errors. The software can only run as well as the hardware allows it.

There’s two important steps you missed in a proper backup procedure, and this applies to all backup systems, not just restic:

  • You need to regularly test your backups by restoring them using a different system. restic check --read-data is also good, but this only tells you that what restic has stored is intact; it doesn’t tell you if restic stored the right data to begin with.
  • The standard 3-2-1 backup process has you keep two copies on-site and an additional copy off-site. You should periodically test all copies. Betting your backups on a single disk is risky, as you have found out.

I know that none of this is what you want to hear and I’m not trying to “shame the victim.” However, putting the blame on restic when either your RAM or storage hardware appears to be faulty is not productive nor accurate. It gives you a target for your anger and frustration right now, but it won’t help you avoid this problem in the future.

3 Likes

I get that you’re probably frustrated right now, but this has nothing to do with restic.
If you have hardware that is faulty this could’ve happened as well if you “just” copied the data to an external drive.
I 100% agree with @cdhowie post.

1 Like

Well said @cdhowie.

I’m new to restic, but I still check my backup once in a while.

And when I know I’m going to wipe my system I’m extra careful. I do a regular backup, I make a copy with rsync to an external drive and I check that both are valid. Having both never failed me so far.

Phew !

I think 6 hours of running Memtest with no errors is enough, it’s at pass #Gazillion or something like that.

So now I’m going to have a crack at moving a few large files from the install HDD and see if they maintain integrity upon return, then the same for the Backup Drive

That’s a good plan – I would also suggest moving the restic repository (and everything else) off of the backup drive, then totally filling it up with some data that you can compare with e.g. diff -qr or a checksum utility. It’s possible that any damage on that HDD could exist only where the restic repository now lives.

Note that it’s also possible that the degredation happens over time and not immediately. You might want to leave the data on there for a few days and then check again to rule that out as well.

1 Like

Can you please try running sha256sum on a few of the files on the disk containing the repository? It would be helpful to know if the hash is equal to the file name for almost all or just a few files so we can assess the damage and make an educated guess what may have happened…

I’ve written a small program to do that in parallel here: https://github.com/fd0/psha

You can build it with Go >= 1.11 by checking out the repo and then running go build. You can call it like this: ./psha /path/to/repo. For a working, healthy repository it should only complain about the file config.

1 Like

If I were you (but I’m not), I would use Btrfs or ZFS. Even with no redundancy on a single drive. The benefit is that all data and metadata extents/blocks are checksummed as they are written. If the checksums differ when reading back, you’ll get an error. An error is generally preferred to silent corruption.

I’m up to hostname “b15”, which means I’ve built (or bought in the case of laptops) 15 computers. And at least that many work laptops/desktops.

I can tell you from experience, the newness and/or relative horsepower of hardware has nothing to do with whether or not it can or will corrupt data.

There are several things that you as a user can do to improve the reliability of most desktop setups, and even laptops:

  • Use ECC RAM. If your system doesn’t support it, make a note to use it next time. As RAM capacities grow exponentially, the odds of cosmic rays flipping bits and corrupting data grows.
  • Store your working data and backups on Btrfs or ZFS redundant arrays. Even if working on a laptop.
    • Both Btrfs and ZFS work great on mirrored SSDs and NVMe. (Or even single SSD/NVMe, to give you silent corruption detection.) Neither is quite as fast as ext4 or NTFS, but still significantly faster than HDDs.
    • Since both are complex software and occasionally have bugs, it’s not a bad idea to use both filesystems on different machines. One for working data via laptop, the other for local backup on a server. I also stagger my linux updates to reduce the odds of an bad update killing all my data at once.
  • For each desktop/server, split the drives (even if just two) across more than one storage controller. Storage controllers are common sources of silent corruption. (It doesn’t happen often, but when it does, it’s often a tossup between memory, storage controller, or drive.)
    • For example, two SAS controllers with SATA fanout cables - or for a cheaper solution, one on-board SATA controller and one cheap PCIe SATA card.
    • I even do this with external chasses. For example, one server has three 5-slot external chasses connected. Each 3-way mirror spans all three chasses. One chassis is plugged into a dedicated USB 3.0 card. The second is plugged into a second dedicated USB 3.0 card. The third is plugged into a dedicated eSATA card. That way, I can lose two whole controllers, or two whole chassis, or even have a catastrophic driver problem, and not lose data.
  • Make sure the drives are different brands, models, or less ideally, batches of the same model widely separated in time. Whether SSD or especially HDD.

Good luck.

I think this will be the last time I try and run restic as a data security programme - it just doesn’t work.
What have I had to do so far ?
buy new RAM and an SSD (because of the checksum mismatch) - it did nothing a trial restore failed.
uninstalled then reinstalled the application.
buy a new local storage device
run a trial backup and restore - failed again.
I’m at my wits end, haven’t a clue what “split the drives across two controllers” even means.
I’ve wasted so much time and money on what was supposed to be quick and painless backing-up and restoring that there’s no possible way I could recommend using restic to anyone I know who I don’t hate.
.

If using SATA drives, for example, put each drive on a different SATA controller to avoid a controller problem from silently corrupting both backups. If using USB drives, the same thing applies. The simplest way to ensure the drives are on different controllers is to use one internal SATA drive and one external USB drive.

This is a little harsh and unnecessarily antagonistic to both the restic author and the forum volunteers who have tried to help you. None of us get paid for this work.

It’s entirely possible that this is a restic bug that, for whatever reason, is only surfacing for you. In my opinion, based on the symptoms, it’s more likely there is a hardware problem that we haven’t isolated yet. Many people use restic daily without issue. We are using it across dozens of servers and have not had any instances of corruption or failed restores yet.

If you use another tool, make sure that it also stores checksums of your data and validate it. If my suspicions are correct, sooner or later you’re going to run into the same issue with another tool.

1 Like

I would very much appreciate if you could do what @fd0 wrote here. It’s a core part in isolating the issue.

On a more general note, I would be weary of the USB controller, even though I’m not saying this is the cause of the problems. Is it possible to (physically) mount the hard drive on a SATA controller/cable internally in the/a computer, and try running restic and the check @fd0 mentioned there?

Thanks.

PS: I understand that you are frustrated. It makes total sense.

I have no idea what @fdo means
I don’t know what running sha256sun on a few files means
I have no clue what is meant by “you can build it with Go>=1.11”
and I’ve connected the 2.5" drive into the PC on a spare SATA channel with no change.
I did however run
restic -r /path/to/backup forget
and forgot a few of the 20 snapshots
then I ran prune and it seemed to fail with a final line
“hash does not match id: want 4fbe166121638c2125260e964d55f8c10526468b4d77dd221f03d3283c206258, got 1cc3131f6cc1c5442d09e444c8723cf3a897da4d01b9710a8a9433b9dddeb30c”

And the reason why I’m having another go at Restic is that I’ve upgraded the computer and the last data restoration was successfully completed running deja-dup.
So my question was, and remains, what’s stopping Restic

  • memtest didn’t throw any errors after 4 hours
  • I don’t know how the check the (now) internal HDD for errors

It means that you run a program that computes a checksum (hash) for file(s). Restic stores the files in such a way that the name of the file is the same as the checksum/hash of the contents of the file. E.g. if the checksum/hash for a given file’s contents is “abcd1234”, then restic would save that file with the name “abcd1234”.

The purpose here is that one can later on check if the file has become corrupt by computing the checksum/hash of the file’s contents again - if that checksum/hash matches the name of the file, then the file is fine, but if the checksum/hash of the file’s contents no longer matches the name of the tile, then the file no longer has the same contents in it as it had when it was saved the first time.

Linux, macOS, etc has tools to check various types of checksums/hashes, and one of those types is named “SHA256”, and there are tools for that. On Linux the command for that tool is sha256sum and you run it with the file you want to check as the argument, e.g. sha256sum thefilename. It will then output the SHA256 checksum/hash of that file’s contents, and you can compare it with the name of the file to see if they match or not, as per above.

He was giving you information about how to build the tool named psha. If you want I can build it for you (so you get a single binary that you can run to check the hashes of your entire repository in one go). Do you want that? If yes, let me know what operating system you are on and whether or not it’s 64-bit.

So, to clarify, this is the disk that you backed up your system to before reinstalling it the “first” time (before you started this forum thread), correct?

Did you do this on the backup repository on that disk that you have had problems with since starting this forum thread? I’m not entirely sure why you’d start modifying the repository that we’re trying to debug, hence the question.

Can you elaborate - in what way did you upgrade it? The OS or the hardware? Do you mean that you were able to restore your files from another backup, or where/how does the deja-dup come into the picture?

“Did you do this on the backup repository on that disk that you have had problems with since starting this forum thread? I’m not entirely sure why you’d start modifying the repository that we’re trying to debug, hence the question.”
I still don’t know how I can check that a new external HDD has “problems” - It is the drive that I purchased specifically to hold the restic repository, I agree that it’s possible to have problems even though new and unused by me but since it isn’t demonstrating any read/write inconsistencies, an wild fluctuations of write speeds, any corrupted files copied from it I don’t know if the drive that’s holding the snapshots can be diagnosed as being the source of any problems.
I have removed the screws from the case, removed the actual drive and connected to the motherboard via SATA and power cables.
The host PC RAM has run a memory test for a day and didn’t indicate any issues.
I have upgraded to a Ubuntu 19.10, and Deja-Dup came into the picture because I am unable to have any confidence in Restic, I can make snapshots, I have issues pruning them (I don’t need the data from when I asked here for a tip last September.
I’ll do some more research and try to run sha256 on a file within the repository

and it’s restic from the Snap store - 0.9.6