Random ciphertext verification failed

marky · September 14, 2024, 2:35am

yes, I posted that above.
I did two full (1.1M files) restores on 17.1, one had a single file failure for ciphertext and the second had no failures.

I don’t know if it is of any diagnostic relevance, all my attempts to restore is with two 10Gbit nics on two subnets which allows me to use SMB multichannel allowing for 20GBit speed. The one that was successful is booting arch in single user mode, manually assigning IP to one nic and restoring /home before X is up. I doubt this was a factor as 0 errors and 1-4 errors with 1.1 million files is a rounding error and I know my SMB works without any errors under heavy use. I’ve also re-done full memory tests on workstation and server as well as full scrub on server ZFS pool.

marky · September 14, 2024, 2:38am

I used latest MemTest86 version 11 iso, doing full 4 passes.
As for drive cables, my machine is using NVME, and the remote box is using ZFS which like restic is very aggressive with data protection and won’t allow hardware errors go unnoticed. I’ve also done full scrubs and tests on the memory and drives of the server. A full check with --read-data which analyzes the entire repository was 100% successful on both local and remote repository. The repository seems to be perfectly in order, especially since each restore complains about a different file that has a ciphertext error, yet reattempting always seems to work and fixes any corruption.

damoclark · September 14, 2024, 4:44am

Hi Marky,

I don’t know if it is of any diagnostic relevance, all my attempts to restore is with two 10Gbit nics on two subnets which allows me to use SMB multichannel allowing for 20GBit speed. The one that was successful is booting arch in single user mode, manually assigning IP to one nic and restoring /home before X is up. I doubt this was a factor as 0 errors and 1-4 errors with 1.1 million files is a rounding error and I know my SMB works without any errors under heavy use. I’ve also re-done full memory tests on workstation and server as well as full scrub on server ZFS pool.

I think this is very relevant. Have you tested booting from arch in single user mode as you explained and doing restores multiple times? Does it consistently reveal 0 ciphertext errors? What if you configure each ethernet interface in turn to compare the results?

I know next to nothing about SMB multichannel. With that qualifier, is it possible that the issue is one of the ethernet cards? Or perhaps, could it be kernel driver related? I know you said that the SMB works without any errors under heavy use. But restic is quite unusual in how computationally intensive it is, and at the same time, that it cryptographically validates all data and in user space. 1-4 errors in upwards of a terabyte of data is small, but clearly significant. Is it possible there are, in fact, very infrequent issues with your SMB set up and they just aren’t being detected?

Alternately, can you try testing your restores with different hardware?

I understand that the advice that you are receiving may sound dismissive of issues with Restic. However, people in the forums see these types of issues time, and time again, and they almost always end up being hardware related. Bugs of this nature, while are not impossible, are quite rare, given we are talking about different versions of restic, some of which that have been in the wild for some time. The likelihood of a bug being revealed now just seems improbable.

Can you try some of these suggestions and let us know what you lean?

D.

marky · September 14, 2024, 5:47am

It takes a few hours to do a restore and my system is down when I do, but I will try to do another when I can.

I wouldn’t think so, and I use a lot of encryption (full disk and other) without errors. What really bothers me is how doing a check --read-data on both local (NAS) and remote (S3) repository had zero errors.

Maybe, not sure what I have that is available with that space. I have to restore all data as I don’t think I can do a partial and guarantee I will see an error, since it is 1-4 errors in 1.1M files.

I will see what I can do about doing more testing, but it will take time to get to them all.

damoclark · September 14, 2024, 6:56am

Yeah, I understand.

Is it the same host that you have been performing backups, and testing restores?

Agreed.

And the --dry-run option of restore will only download tree blobs from the pack files, rather than both tree and data blobs. So it’s not equivalent to doing an actual restore either.

marky · September 14, 2024, 2:06pm

I have a TrueNAS server hosting SMB share on two 10Gbit ports. I have a Linux workstation mounting that share with two 10Gbit ports. The server hosts the repository on SMB (I also have an separate repository on S3 offsite which is the same data, but it’s own backup). The workstation mounts the SMB share and uses it as if it was a local file system.

I have done some testing.

I did another full restore in single user mode. No errors. In single user mode, I only give one nic an IP, and mount the SMB via fstab. The share is still mounted using SMB multichannel which will use multiple connections, but since only one nic is up, it is limited to one nic.

While I don’t think SMB multichannel would be any factor, I did a restore in X using same scenario. I unmounted the SMB share, and I disabled my 2nd 10gbit nic, and then remounted the share. I did a restore and I got one ciphertext error.

Before doing these two tests, I started a full restore in X with both nics enabled as usual, and I got one ciphertext error, but this happened really early so I just cancelled it and did the single user mode test and then the single nic test.

I’m going to do another test umounting the shares, and removing multichannel parameter just for kicks, and see if I get ciphertext errors.

I don’t think single user mode really makes a difference, but it is an odd coincidence both times I restored this way I got no ciphertext errors, when I have yet to get no errors while in KDE Plasma (X). I tried to mimic the environment I use in single user mode, a single nic since I have to manually bring it up, I usually only bring up the first one and manually assign an IP, but the smb share still is mounted with the multichannel flag when I do this, it just can’t send connections to another nic as only one has an ip.

I can try mounting it via NFS later and see if I still get ciphertext errors. The restores take about an hour, so I will try to get to it when I can.

MichaelEischer · September 14, 2024, 3:24pm

That’s rather interesting as restic 0.17.1 tries to read and decrypt the file at least twice. That effectively means that data corruption likely resides in the copy kept in the page cache. Otherwise, the retry would have been able to decrypt it. So either the data retrieved via SMB is corrupt or it gets corrupted on the host running restic. That is also consistent with a later attempt being able to restore the file successfully. By then enough data has been read to clear the page cache.

Is authentication and signing enabled for your SMB setup?

marky · September 14, 2024, 3:54pm

Yes, I provide fstab credentials so it can mount.

MichaelEischer · September 14, 2024, 4:05pm

Which SMB version is in use? Does it use SMB signing (apparently yes, if SMB2 or newer is used)? On Linux this can be checked using smbstatus.

marky · September 14, 2024, 4:30pm

I am using SMB3_11, and yes on signing.

Signing: partial(AES-128-GMAC)

marky · September 15, 2024, 1:52pm

I did some testing, and I am even more puzzled than ever.

I wanted to eliminate smb multichannel from the equation, so I edited my /etc/fstab and removed the multichannel parameter. I unmounted the share, and remounted. Confirmed multichannel was disabled.

I did a full restore, and I had no errors. To be scientific, I did another full restore after deleting the files and clearing my snapshots. No errors.

To further be scientific, I remounted the share with multichannel and both nics enabled again, and no errors.

To put this in perspective, I have done 15+ full restores of 1.1M files in the past week or so. This is largely due to distro hopping and running into problems with nvidia and wayland and not specific to restic. Out of all these restores, every single one has had 1-4 ciphertext errors except for:

2 times using single user mode
2 times yesterday disabling multichannel
1 time yesterday re-enabling multichannel and becoming the environment that had all the failures.

One significant change though, the last 6 restores where with 0.17.1, where previously I was using 0.16.4. With 0.17.1, the most I have seen is 1 ciphertext error, down from typically 4 through the full restore. My gut feeling is 17 is in fact reducing how rare these errors are, but I am still seeing them, just not common enough that it is happening every restore. So I am seeing multiple restores receiving no such errors, and a few that receive only one and all these changes I am making are insignificant.

In short, I am still seeing the issue, just not as frequently (my guess is due to going from 0.16.4 → 0.17.1), regardless of how many factors I eliminate from my environment.

I haven’t tried using different hardware. I will see what I can do about that, but I don’t have something available right yet. I have however attempted using 6+ fresh installs of my operating system (all linux) due to me previously distro hopping in the past two weeks and having to repeatedly restore my data.

MichaelEischer · September 15, 2024, 3:02pm

So the retries to decrypt failed blobs seem to help a bit. That suggests that the data corruption likely happens on the host that runs restic, but besides that we don’t know much. I unfortunately don’t have a good idea how to further debug this issue. It’s just too rare and weird for normal debugging approaches .

marky · September 15, 2024, 3:43pm

So nothing else I can do?

I don’t feel comfortable depending on Restic if I am going to keep getting errors on restore. So far it hasn’t caused data loss as I can retry and it works, and the repo seems to be intact without errors.

rawtaz · September 15, 2024, 5:51pm

It would be interesting to isolate the issue further by trying to reproduce the problem on another system than the one where you are currently running restic. But I know you said you don’t have another system at this point.

It would also be interesting to take SMB out of the equation to see if you can still reproduce the problem when that isn’t involved at all. Is it possible for you to run rest-server on the NAS to serve the respository that was (the restic client would then use the REST backend for the repository instead of the locally mounted path of the SMB share)? The repository files are the same, so it’s just a matter of serving the repository directory this way instead. Disclaimer; I’m not the technical genius here, so there’s no guarantee that this will make a difference. It would just be nice to rule the SMB stuff out.

As @MichaelEischer mentioned, your particular case is hard to debug for various reasons. More systematic work and time is needed to isolate it more and what not.

It’s not restic you should be concerned about here, but the rest of the infrastructure. Restic is helping you find problems. Also, it tells you when e.g. the restore could not be 100% completed due to whatever reason, and verifies the restored files.

This is not the first time that restic has unsurfaced issues that are not obvious and that doesn’t happen with other software, but that has turned out to be either hardware or software issues outside of restic.

Please note that restic has in-place restore (described here), such that you don’t have to restore everything all over again if you need to fix some faulty files.

marky · September 17, 2024, 1:09pm

I did two tests this morning, both on my workstation I’ve been using.

First test is using NFS to TrueNAS instead of SMB. Version 17.1, no errors.
I ran the test again, but using 16.4 and I got two errors.

I am not sure if the two errors is because 16.4 is more error prone (this is my guess as you said 17.1 does some retries) or it’s just not happening as much.

I do think 17.1 reduces the errors to be much rarer, I have only seen 1 error at worst using 17.1, and not 100% of the time, where as 16.4 I was always seeing 1-4 errors and never had a a successful restore.

rawtaz · September 19, 2024, 9:43am

So the same restore on the same machine where you have previously seen the problems, but with a different protocol, also sees the problems happening.

The other thing to try is restore on another system. But I get if you don’t want to spend more time on debugging it. Either way, there’s not much more one can do than play the isolation game to isolate the matter to where it happens, if one thinks that this is of value. In the end, replacing something once it is isolated is probably the solution. I honestly don’t think this is a bug in restic, this is something outside of it, whether it’s hardware or software related.

marky · September 19, 2024, 10:23am

I am trying to do a restore from NAS to NAS now, I’ve done many different experiments trying to provide you as much information as I can.