Copy from SFTP repository on NAS subprocess ssh: Corrupted MAC on input

marlemion · September 27, 2025, 2:47pm

Hi folks,

I just recent came across restic. Beforehand I backed up my data using a self written bash script and rsync with hardlinks to make something like a snapshot-based backup with retention.

Now I wanted to go more professional and set up a restic repository on the NAS (via NFS) and managed to backup using resticprofile withe tags etc. I even managed to read in the old backups by doing mount binds and specifying the original date.

However, when I wanted to copy the repository to another server via sftp, I ran into corruption errors. After reading a bit I reckoned that using NFS and the local backend for restic is not a good idea. So I switched to SFTP, but now other troubles come up (cf. below). I have done the following so far:

Updated ssh on the NAS (QNAP TS-421) via Entware to some 7.x version.

Replaced the network cables between the router and the NAS, switched the network ports (the NAS has two, I use one in single mode).

Tried numerous variations of mac and cipher options for ssh.

Tried numerous ServerAliveX options for ssh.

Watched the debug3 output of sshd on the server (nothing related to the error below).

Set sftp.connections to 2

Set up a local repository on the main server and copied from the NAS thereto - same problem.

The error I am getting is random, but when copying the whole repository, it will occur at some point. It can occur after copying 20 packs, but sometimes after more than 200 packs.

This is the output I am getting:

restic -r restic/ copy --from-repo sftp:nas:/share/MD0_DATA/restic --password-file /etc/resticprofile/password --from-password-file /etc/resticprofile/password -o sftp.connections=2

repository e7ad2153 opened (version 2, compression level auto)
repository 477927a2 opened (version 2, compression level auto)
[0:03] 100.00%  25 / 25 index files loaded
[0:00] 100.00%  1 / 1 index files loaded

snapshot ff638e9b of [/] at 2025-09-20 01:35:29.475577241 +0200 CEST by root@vdr
   copy started, this may take a while…
subprocess ssh: Corrupted MAC on input.subprocess
ssh: ssh_dispatch_run_fatal: Connection to 192.168.178.52 port 2222: message authentication code incorrect
Load(<data/f347aa8ad6>, 17411070, 0) returned error, retrying after 900.395869ms: connection lost
Load(<data/e3f4f619c9>, 18163988, 0) failed: ssh command exited: exit status 255
[10:13] 22.02%  321 / 1458 packs copied
could not load snapshots: context canceled
Remove(<lock/988e3ebbbb>) failed: ssh command exited: exit status 255
error while unlocking: ssh command exited: exit status 255Fatal: StreamPack: ReadFull(<data/e3f4f619c9>): circuit breaker open for file <data/e3f4f619c9>

Any more ideas what I could test?

marlemion · September 28, 2025, 12:47pm

What I have additionally done:

Scrubbed the RAID

Smartclt -t long on each disk involved

RAM Tests on the server and the NAS

All positive

shd2h · September 28, 2025, 2:12pm

The common thread here sounds like it is the NAS, or perhaps more specifically sftp on the NAS. If I’m reading this right:

You were able to perform a backup (from your local machine?) to the NAS accessing the repository over NFS.
You were unable to copy the repository from the NAS to another server using regular SFTP.
You were unable to copy the repository from the NAS to multiple other systems (your local machine? and at least one other server) using restic copy, with the repository on the NAS being accessed over SFTP.

If you switch back to serving the repository over NFS, do the issues persist, or do they disappear?

If they disappear, for me that seems to point pretty conclusively towards the sftp implementation on the NAS. As it’s a QNAP, and not regular linux, maybe there’s something funny going on there.
If they stick around, it points more towards the NAS in general, but you’ve already tested the memory and the disks, which would be the usual suspects. I suppose it could be worth testing the CPU with prime95?

Something else that might be worth considering; if the NAS supports running docker containers, you could stand up a restic rest-server container and try access the repository that way (as an alternative to accessing it via NFS).

shd2h · September 28, 2025, 2:21pm

Searching google suggests you’re not the only one to have this problem with QNAP. There’s a QNAP FAQ that might apply here, depending on the versions involved:

And this forum post sounds almost identical to your symptoms:
https://forum.qnap.com/viewtopic.php?t=175464

tldr; QNAP recommend a specific mac: hmac-sha2-256, the forum post recommended hmac-sha2-512.

You mentioned trying numerous cipher/mac options already, but if you’ve not tried setting one of those specifically, I think it is worth a try.

marlemion · September 29, 2025, 6:08am

Thanks for your reply. Just for the record: I got corrupted packs when copying from nfs. I am currently trying to get a local healthy copy on my server by copying from nfs/sftp // repairing // copying again. Moreover, I have ordered a 4bay DAS to be connected to the server. I hope that will solve the problem. Btw. the QNAP NAS is a TS-412, which is really old (~2013). Nevertheless, it is a weird bug imho.

shd2h · September 30, 2025, 8:13am

Ah okay, that would rule out it being something specific to stfp then. Were the error messages you saw when copying from NFS the same as the ones seen when copying from sftp?

One thought; are all these restic backup/copy jobs are being run on the same system? As in, is restic running on your local pc, and backing up to/copying from the NAS, or another Server? Or were the copy jobs being run on the NAS directly?
If the restic process was running on the local system, checking over the local system’s hardware (specifically memory) would be worthwhile IMO.

marlemion · October 1, 2025, 6:53am

I encountered such errors when copying from nfs to another remote sftp server:

pack a0267f0149961d29983e87352bab36588d8ec9d47a839da1fe159806a3f6b182 contains 2 errors: [blob 6fe5e732be77cf7f25fd8e2299017fa2881598efd3e7b461e53a1c825a5ffb0c: decrypting blob <data/6fe5e732> from a0267f01 failed: ciphertext verification failed unexpected pack id 8e16cdeb0907455f41578499c93101a95b4e37deef10447476f572eec27a8b00]

Hence, I switched from nfs to sftp, where I encountered the sftp specific errors described above.

Setup is:

NAS (restic via nfs or sftp) < – > Server running restic (NUC) < – > Remote sftp restic (NUC over internet)

I have checked the RAM of the local NUC using memtester86 boot iso. No errors.

shd2h · October 1, 2025, 9:17am

Hm as you only saw the ciphertext verification failed with NFS, and not sftp, I think these might be separate issues. The sftp errors seem to be to do with the connection being terminated by the server (NAS), rather than with data corruption.

Were you able to get a healthy copy of the repository by copying/repairing/copying repeatedly to the local system?

marlemion · October 1, 2025, 9:58am

Unfortunately not yet. Weirdly enough, the check –read-data / repair / copy cycles produced more defect packs during time. I am currently running a smartclt -t long on the disk the repository on ther local NUC is resided on. I have not done that before, as it is a usb disk, which did not pass smartctl values with the uas kernel module. However, I found a way to make it run with the usb-storage module (pass usb_storage.quirks=0bc2:2321:u to the kernel).

Currently, I lost trust in both, integrity of the NAS system and the local NUC. I might try to copy to a totally different system (i.e. laptop) to further investigate (and maybe get a healthy copy). I need to check for enought space, as the repository is about ~540G.

shd2h · October 1, 2025, 10:16am

Something it might be worth investigating is whether it is the same data pack files showing up as corrupted each time (i.e. in your previous output it was a0267f0149961d29983e87352bab36588d8ec9d47a839da1fe159806a3f6b182), or whether it is new ones every time.

If new ones, that implies something transient, like a memory/cpu issue. If static, that points more towards storage/disks.

You can also verify the integrity of the pack files in the repository yourself. The checksum should match with the file name. Example below (the repository is at /tmp/repo1)

❯ pwd
/tmp/repo1/data/2f
❯ ls
2f85e823ab7ad4274a94c2d350c6186fc16910e496d62cf9c92a04933ac052bf
❯ sha256sum 2f85e823ab7ad4274a94c2d350c6186fc16910e496d62cf9c92a04933ac052bf
2f85e823ab7ad4274a94c2d350c6186fc16910e496d62cf9c92a04933ac052bf  2f85e823ab7ad4274a94c2d350c6186fc16910e496d62cf9c92a04933ac052bf

marlemion · October 1, 2025, 11:08am

I have actually checked already myself and at least new ones were detected as being corrupted. I have no checked whether the packs detected in the previous run have also been present in the second run (which I have now stopped in favour of checking the disk first). Thanks for showing me how to manually check the integrity.

marlemion · October 2, 2025, 7:18am

Ok, I have tested some more:

The USB disk is fine.

I have now used my laptop to mount the repo on the local NUC via NFS to my laptop, init a new repo locally on the laptop and did a copy from the NUC repo via NFS.

No errors during copying.

Moreover, I did a check –read-data on the local repo on the laptop.

No errors.

Hence, I have a healthy copy on the laptop, but also on the NUC.

Currently, I am checking the repo on the NUC via NFS using a restic process on the laptop. So far it looks ok.

I will also mount the NAS repo via NFS on the laptop and do the same.

For me it looks like a problem located on the NUC and - as you suspected - of the restic process running on the NUC. That is, however, weird. I made a memtest. Maybe I should run it for longer? I have 40G of RAM in the NUC, maybe that is a problem? Can this somehow tracked down? I have now downloaded prime95 for linux and will test the cpu.

Edit: If I check –read-data on the repo on the NUC mounted on the laptop with a restic process running on the laptop I again get ciphertext errors.

shd2h · October 2, 2025, 9:13am

Glad to hear you’ve been able to narrow it down somewhat.

For memtest86+, I’d probably run it for anywhere up to 24h. If nothing shows up at that point, I’d assume memtest isn’t stressing the hardware in the right way to trigger the issue, or it isn’t actually the memory that’s the problem, and move on.

Prime95 is good as a general system stress test when running a torture test with the blend option, as it’ll stress both the CPU and memory, along with system thermals. Again, running it for a longer period of time means the likelihood of the issue being reproduced goes up.

There are other tools, I’m sure everyone has their favourites, but those are the ones I usually reach for.

Given the frequency of you running into issues when running restic, hopefully you won’t need to stress test the hardware for long before it begins to show instability.

I guess this still fits the current theory that something to do with the NUC is the problem. If the NUC is reading and writing bad data, any operation the NUC does to a restic repository would be suspect, so it’s possible the repository on the NUC is actually corrupt on-disk, despite the NUC not detecting this. If so, you should be able to see that by verifying the checksum of the pack files that were reported in error yourself. Otherwise, that just leaves in-flight corruption/bitflips as the only explanation I can think of?

marlemion · October 2, 2025, 9:32am

Ok, I will run both tests for a longer time. Could it also be a problem of a ‘crowded’ server? I use the server (NUC) for many applications, run docker, a webserver, databases, even a VDR server. Hence, I wonder why I never had problems with it before, just now with restic. Btw., I do not think that the repo on the NUC is really corrupt, as I have copied it via rsync/nfs to the laptop and it was working. There is something failing on the layer in between restic - NFS.

Btw. checking the NAS repo via NFS using restic on the laptop did not show an error yet.

shd2h · October 2, 2025, 10:08am

In my opinion, no. If the system is returning corrupted data, it’s indicative of some sort of fault/failure. It shouldn’t do that, regardless of how busy it is, a computer that returns corrupted data when you load it heavily is not a properly functioning computer

Good point. That implies some pretty bad things about the NUC though. After all, the repository files are simply being accessed via NFS, so restic never interacts with the NUC directly, all the operations go through the nfs client process on the laptop. So would other applications accessing files on the NUC via NFS also be served corrupted data periodically?

marlemion · October 2, 2025, 11:07am

On the NUC, restic was locally accessing the files, not via NFS. Hm, it is all very weird. Maybe my PSU is dying.

marlemion · October 4, 2025, 2:19pm

Ok, whilst memtest gave still zero errors after hours of running, I decided to test the RAM modules. I had two running: a 32GB and an 8GB. I replaced the 32GB by another 8GB, and voila, no errors anymore. I have also tested the 32GB only and this is up until now also a stable setup. Hence, it was the mix of the two modules. Maybe that could be added to the docs as a first start of investigating RAM problems.

shd2h · October 4, 2025, 4:12pm

Great you were able to find a solution. Less great it involved having to pull a stick of memory out.

Mixing different memory modules seems to be one of those “your mileage may vary” things in computing. Most of the literature I’ve come across recommends against it, as does a cursory google search. In my experience, RAM with the same specs almost always mixes fine*, whereas ram with same capacity but different latencies/timings usually mixes fine… Essentially the more you change the less likely things are to work perfectly.

*fine is a relative term here, some programs (like restic) are more sensitive to memory stability than others, as you’ve discovered

marlemion · October 5, 2025, 11:47am

I have now marked this as solved. I haven’t encountered any errors since removing the 8GB module and leaving the 32GB module. It is definitely interesting that restic can proof such errors, while they seem to be hidden in other applications. However, I have not seen any ‘greenish’ wrongly decoded video frame from vdr from the same machine since removing the second module. Maybe that is a nice side effect. Nevertheless, I am glad I can keep restic, as I am really comfortable with the way it handles things. Many thanks to your support and kindness! It is such a rare thing nowadays in the internet. Let me know in case I need to do something to mark this thread as closed.

shd2h · October 5, 2025, 1:33pm

That may well be the case!

You are most welcome You don’t need to do anything else, now you’ve marked the thread as solved.