TCP reset during restore with swift backend

mathiash · August 28, 2023, 11:12am

Hello,

I’m having TCP reset issues while restoring snapshots from a swift backend

# restic restore 5659b232 --target /data/restore
restoring <Snapshot 5659b232 of [/data/backup/instance/20230825_000015] at 2023-08-26 06:01:28.789539585 +0000 UTC by root@host> to /data/restore
...
ignoring error for /data/backup/instance/20230825_000015/20_incr_20230825_220023/drive_147693/FTS_0000000000354c56_00000000008a9f52_INDEX_2.ibd.meta: StreamPack: read tcp <restore-host-IP>:9056-><swift-proxy-IP>:443: read: connection reset by peer
...
ignoring error for /data/backup/instance/20230825_000015/20_incr_20230825_220023/drive_147693/FTS_0000000000354c56_00000000008a9f52_INDEX_2.ibd.meta: UtimesNano: no such file or directory
...
Summary: Restored 18857879 / 18858339 Files (543.684 GiB / 543.684 GiB) in 1:28:13
Fatal: There were 7303282 errors

I logged only one file here but many more are affected.

As expected from the errors, the file is missing from the filesystem, but it can be dumped without issues:

# restic dump 5659b232 /data/backup/instance/20230825_000015/20_incr_20230825_220023/drive_147693/FTS_0000000000354c56_00000000008a9f52_INDEX_2.ibd.meta
repository cce22958 opened (version 2, compression level auto)
page_size = 16384
zip_size = 0
space_id = 3492942

It’s also possible to restore with --include

# restic restore 5659b232 --target /data/restore --include /data/backup/instance/20230825_000015/20_incr_20230825_220023/drive_147693/FTS_0000000000354c56_00000000008a9f52_INDEX_2.ibd.meta
repository cce22958 opened (version 2, compression level auto)
restoring <Snapshot 5659b232 of [/data/backup/instance/20230825_000015] at 2023-08-26 06:01:28.789539585 +0000 UTC by root@host> to /data/restore
Summary: Restored 7 / 1 Files (51 B / 51 B) in 0:00

After which the file is available on the filesystem:

# cat /data/restore/data/backup/instance/20230825_000015/20_incr_20230825_220023/drive_147693/FTS_0000000000354c56_00000000008a9f52_INDEX_2.ibd.meta
page_size = 16384
zip_size = 0
space_id = 3492942

I ran a check just to make sure:

# restic check --read-data
using temporary cache in /tmp/restic-check-cache-1165536201
repository cce22958 opened (version 2, compression level auto)
created new cache in /tmp/restic-check-cache-1165536201
create exclusive lock for repository
load indexes
check all packs
check snapshots, trees and blobs
[4:34] 100.00%  5 / 5 snapshots
read all data
[46:23] 100.00%  6727 / 6727 packs
no errors were found

The issue is reproductible as it happened multiple times with this snapshot and also snapshots from other instances. However not all snapshots and not all instances are affected.

Obviously nothing here points towards a restic issue, and I’ve already reached out to my swift provider for investigation, but I would love to know if you think tunables like pack size or connection number could help, or if you have tips on how to perform a retry, as running another restore usually returns the same errors, and parsing stderr for missing files to restore isn’t convenient.

Version used :

restic 0.16.0 compiled with go1.20.6 on linux/amd64

Thanks a lot in advance for your help,

Have a nice day

mathiash · August 30, 2023, 8:39am

Update :

The StreamPack: read tcp ... connection reset by peer errors start showing up exactly 60 minutes after the start of the restore process. My hypothesis is that we are facing a timeout on the swift proxy side but we haven’t been able to confirm it.

The workaround I have found is to restore the snapshot in separate parts using the --include and --exclude flags in order to never spend more than one hour on a single restore operation.

However it seems to me that restic should be able to retry to connect and recover after the TCP reset. Can anyone confirm that this is not currently the case?

Please let me know if you need more information and if I should file an issue on Github.

Thanks

MichaelEischer · September 4, 2023, 8:07pm

That looks like the TCP connection got stuck for some reason. The problem is sort of already tracked as Set timeouts for backend connections · Issue #4193 · restic/restic · GitHub .