Random ciphertext verification failed

marky · September 9, 2024, 10:33pm

restic 0.16.4 compiled with go1.22.5 on linux/amd64

I am getting random ciphertext verification failed errors during restore.

ignoring error for /home/marky/.local/share/Steam/steamapps/common/Helldivers 2/data/df732158f005b184.stream: ciphertext verification failed
ignoring error for /home/marky/.local/share/Steam/steamapps/common/Helldivers /data/de8eccc7aa419fc2.stream: ciphertext verification failed
ignoring error for /home/marky/.local/share/Steam/steamapps/common/Helldivers 2/data/09985dc611a3a8b6.stream: ciphertext verification failed

My situation, my last backup was about 4 days ago. I regularly backup to two restic repos, one on-site and one off-site.

I have been distro hopping Linux trying to find a good option as things are kind of in flux right now. So I haven’t been updating my backup, but I have restored from it multiple times in part and in full over the last 4-5 days.

Randomly, I will get ciphertext verification failed for some files. Thinking there might be something wrong with my repo (despite restoring from it multiple times) I ran a full check --read-data which completed flawlessly.

using temporary cache in /tmp/restic-check-cache-857530271
repository 9f975546 opened (version 2, compression level auto)
created new cache in /tmp/restic-check-cache-857530271
create exclusive lock for repository
load indexes
[0:02] 100.00%  39 / 39 index files loaded
check all packs
check snapshots, trees and blobs
[0:17] 100.00%  152 / 152 snapshots
read all data
[1:37:35] 100.00%  64153 / 64153 packs
no errors were found

One of the above files that failed, reattempted and it worked fine.

(base) marky@marksville:~$ restic restore --target / --include "/home/marky/.local/share/Steam/steamapps/common/Helldivers 2/data/df732158f005b184.stream" latest
repository 9f975546 opened (version 2, compression level auto)
[0:01] 100.00%  39 / 39 index files loaded
restoring <Snapshot b073ce48 of [/home/marky] at 2024-09-06 17:40:17.548963199 -0400 EDT by marky@marksville> to /
Summary: Restored 10 / 1 files/dirs (13.333 MiB / 13.333 MiB) in 0:00

Usually with hundreds of thousands of files, it is only a handful that get this message.

This repo is my local repo on a TrueNAS SMB share mounted via fstab.

These random errors are making me very uncomfortable depending on restic for my backups.

martinleben · September 9, 2024, 10:48pm

Hmm… I would say it is the other way around: Thanks to restic you now know with rather high certainty that one or more of your machines has bad HW.

Depending on only one backup is not optimal. Having another backup is preferable.

marky · September 9, 2024, 11:04pm

How so?

The data is stored on ZFS, which is the peak of ensuring data integrity.

  pool: tank
 state: ONLINE

config:

        NAME                                      STATE     READ WRITE CKSUM
        tank                                      ONLINE       0     0     0
          raidz2-0                                ONLINE       0     0     0
            5748e94e  ONLINE       0     0     0
            c419a255  ONLINE       0     0     0
            fb1d244d  ONLINE       0     0     0
            62ea06fb  ONLINE       0     0     0
            f26c9c731 ONLINE       0     0     0
            b9727b5c  ONLINE       0     0     0

errors: No known data errors

Depending on only one backup is not optimal. Having another backup is preferable.

I have two unique backups as I stated, both using Restic. One on-site, one off-site.

rawtaz · September 9, 2024, 11:10pm

You should probably start by running memory tests.

marky · September 9, 2024, 11:25pm

I have, and I have no hardware issues.
Everything in my workflow is full disk encrypted, my workstation, my NAS, and my backups. Nothing has had a single error regarding data integrity outside of Restic.

Restic backups are stored on a ZFS volume with aggressive data integrity tools.
I regularly scrub my ZFS and BTRFS volumes.

I also do monthly test partial restores.

rawtaz · September 9, 2024, 11:38pm

Which involved hosts did you run the memory tests on?

Can you try to reproduce the problem with the latest 0.17.1 release of restic?

Please also provide all the relevant information such as the repository and other relevant environment variables for restic, so we get a more complete picture. Generally speaking, it’s best to always include the complete commands, including env vars, and all of their output. By the way, see if you can still edit your initial post to add that information, it might be possible.

marky · September 9, 2024, 11:55pm

I’d have to manually download 0.17.1 as my distro isn’t providing it. Will do that in a bit.

This gets put into the environment prior to backup/restore. When using remote, I am using S3 variables instead.

export RESTIC_HOST=""
export RESTIC_PATH="$HOME"
export RESTIC_REPOSITORY=""
export RESTIC_PASSWORD_FILE=

As for what is executed:

restic restore --target / --include /home/marky latest

rawtaz · September 10, 2024, 12:00am

Are you telling me that your RESTIC_REPOSITORY env var is empty when you run the restore command? If not, please tell us what its value is.

marky · September 10, 2024, 12:02am

No it isn’t, but I of course sanitized the output.
It’s a local path to my system /mnt/backup.

akrabu · September 10, 2024, 4:26am

He’s not wrong. Restic is almost a diagnostic tool, with the way it inadvertandly stress-tests hardware. I’ve discovered bad RAM chips and failing drives using it several times now.

marky · September 11, 2024, 3:40am

I tested my ram and no errors.

I have done a full read check of 1.2TB of files (1.1M total files) and not a single error. But when I restore, I always have 1-4 files that show a error. If I retry those files, they always work successfully.

The files are on the most data integrity safe file system (ZFS) with it’s own aggressive error/checksum detection and correction.

kapitainsky · September 11, 2024, 5:55am

I think given what you described (random not repeatable failures) should rather boost confidence in restic but lower in other parts of your solution.

Indeed I would not suspect ZFS to be responsible but are you using ECC RAM? On the surface it looks like exactly RAM errors.

rawtaz · September 11, 2024, 9:22am

@marky Please be more specific. I wish you would be providing the information in a clearer manner as it’s otherwise making it hard to get an overview.

Please edit your first post if possible to reflect the full commands that you run, including env var settings before them, preferably put a hostname like “nas” and “client” or whatever before the prompt so that it is clear where you are running the command, and so on.

Is the repository stored on the TrueNAS and then mounted from/on another computer where you run the restic client, using SMB?

Please tell us what operating system and kernel it is that’s running on the client, and if you don’t mind also the NAS.

How did you test your ram, and which of the two systems did you do it on (both or one of them?)?

To answer what you say about ZFS and such; I agree, it’s superb and I use it myself in many places and love it. I think it’s great for data integrity and such. But we cannot rule out anything at this point, and the symptoms so far sure does smell like hardware issues or something software related, e.g. with SMB. Either way, nothing is ruled out, and there’s nothing else to do than to debug the issue systematically.

MichaelEischer · September 11, 2024, 7:25pm

check and restore stress the system in slightly different ways, in particular restore also has to wait until the restored data was written to disk. Changing ciphertext verification failures on each run mean there’s something nondeterministic going on. restic 0.16.4 is by now old enough that I’m pretty confident that we’d have learned about bugs that only show up when processing petabytes of data. So the most likely explanation is some bitflip in your hardware, with CPU and memory being the most common culprits.

The ciphertext errors will likely just disappear with that version. Since version 0.17.0, restic silently retries loading blobs that failed to decode the first time. As the ciphertext errors seem to be very rare, are second try should be enough to solve the problem.

marky · September 11, 2024, 7:48pm

It won’t let me edit my first post. I can only seem to be able to delete it.

Please be more specific. I wish you would be providing the information in a clearer manner as it’s otherwise making it hard to get an overview.

I provided most of the information requested, but I will summarize here. I felt I have been very detailed in my explanations. I will do my best to provide anything I have not.

Env

export RESTIC_HOST="mypc"
export RESTIC_PATH="$HOME"
export RESTIC_REPOSITORY="/mnt/backup"
export RESTIC_PASSWORD_FILE=.restic/passwd

Is the repository stored on the TrueNAS (ZFS) and then mounted from/on another computer where you run the restic client, using SMB?

Repo is stored on TrueNAS and exposed to my pc via an SMB share. I am mounting the SMB in fstab on my pc as to enable multichannel. To be clear, /mnt/backup is an SMB share from TrueNAS mounted on my machine via /etc/fstab.

Please tell us what operating system and kernel it is that’s running on the client, and if you don’t mind also the NAS.

NAS

Linux truenas01 6.6.32-production+truenas #1 SMP PREEMPT_DYNAMIC Mon Jul 8 16:11:58 UTC 2024 x86_64 GNU/Linux

Workstation

Linux mypc 6.10.7-pikaos #101pika4 SMP PREEMPT_DYNAMIC Sat Aug 31 05:34:32 EDT 2024 x86_64 GNU/Linux

fstab entry

 //192.168.1.21/backup /mnt/backup cifs credentials=/etc/cifs-credentials,multichannel,gid=1000,uid=1000 0 0

restic command

restic restore --target / --include /home/marky latest

I have run a full check --read-data on both repos (one stored on NAS, one stored remotely on S3, two completely separate backups, not “cloned”) and they have zero errors.

full check

restic check --read-data

I have a monthly task where I manually do a restic check --read-data-subset=5G on both local and remote repo. I do this without fail, on the first or second day of the month.

The repo is 1.17TB, and about 1.1M files. Not small, so serious errors should be fairly apparent.

When I do a restore, 1-4 files will give ciphertext error, but will be fine if I try to restore that one file individually after the restore completes. A full 100% check --read-data across 1.2TB is flawless on both repos.

I was using Restic 16.4, I downloaded the 17.1 binary and tested with that, and had same results. Out of 1.1M files restored, one file had a cyphertext error.

This one file had a cipher text error, yet it was restored, but with errors.

What the file should be, and what it is from the restore from the same repo and same snapshot just 3 days ago.

Re-running the restore just for that one file, it restored fine and is as expected.

For testing ram, I used MemTest, and to be 100% sure I reran it (from ISO so only hardware involved) on both NAS and workstation and no errors.

marky · September 11, 2024, 7:48pm

I’ve provided additional detail here.

akrabu · September 11, 2024, 9:03pm

Honestly… this seems familiar. But I never could reproduce it, and it never happened again. The snapshots before and after were completely intact, but one snapshot in particular had just a handful of CRC mismatches.

Mine was backing up an external SSD to the cloud… but I had also used rhash to embed CRC codes in the filename for occasional “scrubbing” purposes. That’s how I caught it.

Question. Have you done another snapshot after this? Do those files appear OK? What about the snapshot before the corrupted one, if there is one?

Just curious if we’ve ran into the same bug, or if this is unrelated… I could never prove or disprove whether it was my hardware or Restic, but I have yet to ever have this issue again, and I’m still using the same drive which has zero issues according to the SMART data. The original files on the drive check out, as do the snapshots immediately before, and after. It’s just that one snapshot with randomly corrupt data. I should add I don’t get any ciphertext errors, however. So maybe it’s unrelated. But this will forever be my white whale until I figure out what in the world happened haha

marky · September 13, 2024, 7:29am

I haven’t tested other snapshots, I am using latest to restore.

But a full check with full --read-data is 100% flawless, no errors on 1.1M+ files on two repos that back up the same data, but don’t clone each other, one is on-site and one is off-site.

I have a feeling other snapshots will have same problem as each time I do a restore, it is a different file that complains, and re-attempting the file has always worked. Out of 1.1M files, it is usually 1-4 files that get this error.

I have no idea why it keeps doing it, they said it is memory errors, but I did a full 4 pass test on my memory with zero errors.

akrabu · September 13, 2024, 4:19pm

Hmm. Man that really does sound like memory errors. What are you using to test the memory? I’d want to run Memtest86+ overnight, myself. Maybe replace the drive cables too?

Yeah your issue is different than mine. Mine is always the same files, and only from that snapshot. The snapshot immediately before, and all the ones after, are perfectly fine.

rawtaz · September 13, 2024, 6:43pm

Have you tried 0.17.1 as per Michael’s suggestion? Would be interesting to see if the retries will sidestep the problem or if it’s not just temporary.