Restic check - http2: client connection lost

Good morning,
I work in the public sector, and over the past two years, we’ve transitioned our backup concept to restic.

We back up virtual machines (VMs) daily across multiple locations to local NAS systems using rest-server (TLS, append-only, private repos). Inside the VMs, we also perform daily backups of all data to a separate repository. Additionally, we back up all VMs weekly to a central location. All repositories are checked daily using restic check—more on this later.

Everything runs fully automated. Repository passwords, certificates, and other login data for rest-server are securely stored and can only be decrypted and executed by a specific user or Task. We are very satisfied with this system. Restic is efficient, fast, and meets our security requirements. More locations will be added soon. Therefore, I’d first like to say THANK YOU—thank you for this amazing project and your dedication!

Now to our issue:
We perform daily repository checks with restic check. Sporadically, regardless of the location, we encounter the following errors:

Example:

restic.exe : Load(<data/d1afd7ff32>, 18755454, 0) returned error, retrying after 1.054762386s: 
readFull: http2: client connection lost
In C:\XXXX\XXXX\restic_check.ps1:40 Zeichen:1
+ & C:\restic\restic.exe check `
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (Load(<data/d1af...connection lost:String) [], RemoteE 
   xception
    + FullyQualifiedErrorId : NativeCommandError
 
Load(<data/119dc143ea>, 18007490, 0) returned error, retrying after 1.379831123s: readFull: http2: 
client connection lost
Load(<data/51aa92ab30>, 17019660, 0) returned error, retrying after 821.278225ms: readFull: http2: 
client connection lost
Load(<data/51aa92ab30>, 17019660, 0) returned error, retrying after 821.278225ms: readFull: http2: 
client connection lost
Load(<data/71460607f1>, 17227144, 0) returned error, retrying after 1.142240439s: readFull: http2: 
client connection lost
Load(<data/51aa92ab30>, 17019660, 0) returned error, retrying after 821.278225ms: readFull: http2: 
client connection lost
Load(<data/71460607f1>, 17227144, 0) returned error, retrying after 1.142240439s: readFull: http2: 
client connection lost
Load(<data/d1279e34b3>, 17598027, 0) returned error, retrying after 1.273035571s: readFull: http2: 
client connection lost
Load(<data/51aa92ab30>, 17019660, 0) operation successful after 1 retries
Load(<data/d1afd7ff32>, 18755454, 0) operation successful after 1 retries
Load(<data/71460607f1>, 17227144, 0) operation successful after 1 retries
Load(<data/d1279e34b3>, 17598027, 0) operation successful after 1 retries
Load(<data/119dc143ea>, 18007490, 0) operation successful after 1 retries

These “http2: client connection lost” errors occur exclusively during restic check.

In some cases, the process ends successfully with “operation successful after x retries”. However, in many cases, the process is aborted by restic.

Here’s an excerpt of our check script:

& C:XXX\restic.exe version
& C:\XXX\restic.exe check `
    --retry-lock 1h `
    --cleanup-cache `
    --stuck-request-timeout 10m `
    --read-data-subset=10% `
    2> C:XXX\XXX\check_error_log_${BackupSystem}.txt 

if ($LASTEXITCODE -ne 0) {
    Send-CheckErrorMessage -BackupSystem $BackupSystem
}

We added the --stuck-request-timeout 10m flag, but it didn’t improve the situation. Similarly, using -o http2.timeout=10m didn’t help either.

Where do these errors come from? Why do they only occur during check, specifically during read operations?

We are using restic 0.17.3 and rest-server 0.11.

Thank you in advance!

Thanks for the kind words!

When you are seeing the errors on the client side, are there any corresponding errors logged by rest-server?

Have you been able to see if you can reproduce the problem with rest-server 0.13 as well?

http2: client connection lost is triggered when the http2 connection is stuck for more than 60 seconds. This timeout is currently not configurable. There’s no -o http2.timeout option.

For that the rest-server connection has to be problematic for more than 15 minutes. I guess the error is always the same?

check reads a lot of data which might trigger the underlying connection issues. Is the rest-server maybe overloaded during that time?

rest-server 0.13 is built using a much newer go version than 0.11, which might solve the stuck http2 connections.

1 Like

Thanks for your reply. No, so far we have only worked with version 0.11, as this version was available for QNAP NAS systems and worked flawlessly. Since we haven’t encountered any issues on the restic server side, we haven’t applied for a developer account with QNAP to create new QNAP packages (.qpkg). However, we will do so now.

Unlikely. We have adjusted the timings so that there are no parallel accesses locally. However, the errors actually occur almost exclusively on the local NAS systems.

Interestingly, the problem hardly occurs when checking the repositories on the central NAS, even though up to 10 clients access it in parallel.

I just realized that I can easily replace the rest-server binary and adjust the version number in /etc/conf/qpkg.conf. I will test it for a while on my test system and then try it at selected locations. I will provide feedback in about two weeks to give the local admins time to observe it. Thank you for the tip! :grinning:

So… I’m back. That was faster than expected. After installing rest-server 0.13 on the Test-NAS and running an initial restic check, I got the following log:

Bitte geben Sie einen Befehl ein (oder 'exit' zum Beenden).: check --read-data-subset 20%
using temporary cache in C:\XXX\XXX\restic-check-cache-2941783010
create exclusive lock for repository
repository ce89b5ba opened (version 2, compression level auto)
created new cache in C:\XXX\XXX\restic-check-cache-2941783010
load indexes
[0:02] 100.00%  101 / 101 index files loaded
check all packs
check snapshots, trees and blobs
[0:16] 100.00%  2 / 2 snapshots
read 20.0% of data packs
Load(<data/72dc89d390>, 17481343, 0) returned error, retrying after 904.341928ms: readFull: http2: client connection lost
Load(<data/3244f47e40>, 17735950, 0) returned error, retrying after 601.362286ms: readFull: http2: client connection lost
Load(<data/82a4c2f323>, 18054514, 0) returned error, retrying after 1.22829129s: readFull: http2: client connection lost
Load(<data/82a4c2f323>, 18054514, 0) returned error, retrying after 1.22829129s: readFull: http2: client connection lost
Load(<data/527fcc6e28>, 16787954, 0) returned error, retrying after 1.191554422s: readFull: http2: client connection lost
Load(<data/82a4c2f323>, 18054514, 0) returned error, retrying after 1.22829129s: readFull: http2: client connection lost
Load(<data/527fcc6e28>, 16787954, 0) returned error, retrying after 1.191554422s: readFull: http2: client connection lost
Load(<data/729f4a88c3>, 17678718, 0) returned error, retrying after 965.6919ms: readFull: http2: client connection lost
Load(<data/3244f47e40>, 17735950, 0) operation successful after 1 retries
Load(<data/72dc89d390>, 17481343, 0) operation successful after 1 retries
Load(<data/729f4a88c3>, 17678718, 0) operation successful after 1 retries
Load(<data/527fcc6e28>, 16787954, 0) operation successful after 1 retries
Load(<data/82a4c2f323>, 18054514, 0) operation successful after 1 retries
Load(<data/437916efa5>, 16844318, 0) returned error, retrying after 1.072920337s: readFull: http2: client connection lost
Load(<data/732e95c8b4>, 17943015, 0) returned error, retrying after 1.408926319s: readFull: http2: client connection lost
Load(<data/732e95c8b4>, 17943015, 0) returned error, retrying after 1.408926319s: readFull: http2: client connection lost
Load(<data/e3b65a4e40>, 17459052, 0) returned error, retrying after 675.92406ms: readFull: http2: client connection lost
Load(<data/732e95c8b4>, 17943015, 0) returned error, retrying after 1.408926319s: readFull: http2: client connection lost
Load(<data/e3b65a4e40>, 17459052, 0) returned error, retrying after 675.92406ms: readFull: http2: client connection lost
Load(<data/63399652f9>, 16885524, 0) returned error, retrying after 916.210263ms: readFull: http2: client connection lost
Load(<data/732e95c8b4>, 17943015, 0) returned error, retrying after 1.408926319s: readFull: http2: client connection lost
Load(<data/e3b65a4e40>, 17459052, 0) returned error, retrying after 675.92406ms: readFull: http2: client connection lost
Load(<data/63399652f9>, 16885524, 0) returned error, retrying after 916.210263ms: readFull: http2: client connection lost
Load(<data/f321c5811c>, 17508491, 0) returned error, retrying after 1.418534878s: client.Do: Get "https://XXX@XXX:XXX/XXX/data/f321c5811c07f4beacd05843582c1b1a6ac9382923a176a56fc833f9b7f80b62": http2: client connection lost
Load(<data/e3b65a4e40>, 17459052, 0) operation successful after 1 retries
Load(<data/63399652f9>, 16885524, 0) operation successful after 1 retries
Load(<data/437916efa5>, 16844318, 0) operation successful after 1 retries
Load(<data/f321c5811c>, 17508491, 0) operation successful after 1 retries
Load(<data/732e95c8b4>, 17943015, 0) operation successful after 1 retries
Load(<data/bc6b856540>, 18292333, 0) returned error, retrying after 813.234017ms: readFull: http2: client connection lost
Load(<data/bc6b856540>, 18292333, 0) returned error, retrying after 813.234017ms: readFull: http2: client connection lost
Load(<data/bcb52a3dda>, 17171649, 0) returned error, retrying after 1.015584362s: readFull: http2: client connection lost
Load(<data/bcb52a3dda>, 17171649, 0) returned error, retrying after 1.015584362s: readFull: http2: client connection lost
Load(<data/8ccb96f915>, 16941409, 0) returned error, retrying after 1.28574862s: readFull: http2: client connection lost
Load(<data/bcb52a3dda>, 17171649, 0) returned error, retrying after 1.015584362s: readFull: http2: client connection lost
Load(<data/8ccb96f915>, 16941409, 0) returned error, retrying after 1.28574862s: readFull: http2: client connection lost
Load(<data/6c20136b78>, 17261533, 0) returned error, retrying after 918.101715ms: readFull: http2: client connection lost
Save(<lock/7a742ae990>) returned error, retrying after 607.954174ms: Post "https://XXX@XXX:XXX/XXX/locks/7a742ae990c500dfdbbbf98e8bca85ba9c9a8f757218b079f66342ebaa932ae6": http2: client connection lost
Save(<lock/7a742ae990>) operation successful after 1 retries
Load(<data/bc6b856540>, 18292333, 0) operation successful after 1 retries
Load(<data/6c20136b78>, 17261533, 0) operation successful after 1 retries
Load(<data/bcb52a3dda>, 17171649, 0) operation successful after 1 retries
Load(<data/8ccb96f915>, 16941409, 0) operation successful after 1 retries
Load(<data/dc659f3cbb>, 18984482, 0) operation successful after 1 retries
[43:29] 100.00%  14820 / 14820 packs
no errors were found

I don’t think this is a software issue in restic, that seems very unlikely. I would start inspecting/debugging the network connectivity, I think this could be hardware related including cabling or even interference at the end of the day. But start with verifying network connectivity and stability.

Can it be related to an old and well documented HTTP/2 issue in Golang implementation. Here is good summary:

It is well known problem in other projects using Golang. For example in rclone disabling HTTP/2 results often in noticeable performance improvements. It has been even suggested and is being discussed for HTTP/2 to be disabled by default.

Is it possible that here we can see manifestation of the same problem? It would be easy to validate by disabling HTTP/2 and OP running his test again.

1 Like

Why would the problem be isolated to this specific NAS though, and not the other ones :thinking: Then again, they wrote “hardly”, so it isn’t entirely not happening on the other ones as well. Perhaps you are on to something @kapitainsky.

It is not easy-peasy to have this problem reliably replicated in the wild. The bug is triggered only when underlying TCP connection is broken which is probably rare in a LAN but much more likely when some WAN link is involved.

But when it is happening then it is easy to narrow it down to HTTP/2 issues - try to run the same with and without HTTP/2 support enabled. It is what many rclone users reported for years. Very often disabling HTTP/2 makes things to start working again smoothly.

Maybe it would be beneficial to introduce option to disable HTTP/2 in restic?

1 Like

You can set the environment variable GODEBUG=http2client=0 on the client side or GODEBUG=http2server=0 on the server side to disable http2.

3 Likes

Summary:

The issue frequently occurs during the check of the local NAS at different locations, and more rarely on the central NAS.

All locations have the same local NAS model:
QNAP TS-435XeU

Architecture:
    NIC: Marvell CN913X
    CPU: Quad-Core ARMv8 Cortex-A72 CPU@2.2 GHz

Used versions:

rest-server 0.13 (ARM64)
rest-server 0.11 (ARM64)

Central NAS model:
QNAP 873AeU

AMD Ryzen Embedded V1500B Quad-Core 2.2GHz
NIC: Intel(R) Gigabit Ethernet Network Driver (dmesg | grep -i "eth")

Used version:

rest-server 0.13 (AMD64)

I tried it without HTTP/2. Here’s the result:
I found a location where this error seems to be reproducible:

PS C:\XXXX\XXXX> $env:GODEBUG
http2client=0
PS C:\XXXX\XXXX> C:\XXX\XXX\restic_console.ps1

This script starts an interactive console with the repository.
Please enter the repository name, OEXYZ or OEXYZ-VM: XXX
Please enter the IP address of the NAS: XXXXXXXXXXXX
Please enter the repository password: ********************
Searching for the certificate...
Establishing connection: rest:https://XXX@XXXX:XXXXX/XXXX
Please enter a command (or 'exit' to quit): check --read-data-subset 20%

using temporary cache in C:\XXXX\XXX\restic-check-cache-2824933248
create exclusive lock for repository
repository 2ae267ec opened (version 2, compression level auto)
created new cache in C:\XXXX\XXX\restic-check-cache-2824933248
load indexes
[0:08] 100.00%  337 / 337 index files loaded
check all packs
check snapshots, trees and blobs
[0:48] 100.00%  13 / 13 snapshots
read 20.0% of data packs
[1:38:01] 100.00%  13395 / 13395 packs
no errors were found
Please enter a command (or 'exit' to quit): version
restic 0.17.3 compiled with go1.23.3 on windows/amd64
Please enter a command (or 'exit' to quit): exit

The check took 1 hour and 38 minutes, with 2 interruptions of over 30 minutes each.
During this time, the transfer rate dropped to 0 MBit/s.

You mean that without HTTP/2 there are no errors?

No errors are logged, but the two interruptions still point to the same issue… in my opinion.

A stuck connection is exactly what triggers the HTTP2 client connection lost error. However, the stuck connection detection of restic should also apply to HTTP/1 connections. So maybe the connection is not stuck but only very slow. 128kb every 5 minutes is sufficient to consider a connection as alive. Either way that sounds like some serious network issue.