Restic gets stuck (rarely)

Hi :wave:

As title indicates, restic process sometimes gets stuck.
Endpoint is s3 (minio) and process seems to be in SNl state, so it waits for “something”. I have a suspicion that it gets stuck when the s3 endpoint doesn’t respond correctly/fast but I couldn’t find any errors/exceptions on the minio side.

I’d like to provide some meaningful data but I am not sure what could be helpful, so I don’t want to prematurely spam github issues. Here is the trace after I sent kill -3 to the restic process. Couldn’t paste here due to character limit.

Random idea: Could it be possible for the process to respond to a signal like USR1 to create something more useful :thinking:?

Restic version: restic 0.15.1 (v0.15.1-5-g590eb9efd) compiled with go1.19 on linux/amd64

Any help/theory is appreciated :metal:

The culprit seems to be

goroutine 86 [select, 407 minutes]:
runtime.gopark(0xc0009a4330?, 0x6?, 0xf0?, 0x3f?, 0xc0009a4144?)
        runtime/proc.go:363 +0xd6 fp=0xc0009a3fa8 sp=0xc0009a3f88 pc=0x438896
runtime.selectgo(0xc0009a4330, 0xc0009a4138, 0x110ac1f?, 0x0, 0xc0009a4120?, 0x1)
        runtime/select.go:328 +0x7bc fp=0xc0009a40e8 sp=0xc0009a3fa8 pc=0x447cdc
net/http.(*persistConn).roundTrip(0xc00068a240, 0xc0006d5680)
        net/http/transport.go:2620 +0x974 fp=0xc0009a43a0 sp=0xc0009a40e8 pc=0x6dc614
net/http.(*Transport).roundTrip(0xc00026a280, 0xc000318800)
        net/http/transport.go:595 +0x7ba fp=0xc0009a45c8 sp=0xc0009a43a0 pc=0x6d017a
net/http.(*Transport).RoundTrip(0x0?, 0x0?)
        net/http/roundtrip.go:17 +0x19 fp=0xc0009a45e8 sp=0xc0009a45c8 pc=0x6c4259
github.com/restic/restic/internal/backend/limiter.staticLimiter.roundTripper({0x0?, 0x0?}, {0x12ce1c0?, 0xc00026a280?}, 0xc0008f6580?)
        github.com/restic/restic/internal/backend/limiter/static_limiter.go:79 +0x1ea fp=0xc0009a4680 sp=0xc0009a45e8 pc=0xd7ff4a
github.com/restic/restic/internal/backend/limiter.staticLimiter.Transport.func1(0x0?)
        github.com/restic/restic/internal/backend/limiter/static_limiter.go:94 +0x2f fp=0xc0009a46b8 sp=0xc0009a4680 pc=0xd8026f
github.com/restic/restic/internal/backend/limiter.roundTripper.RoundTrip(0x203000?, 0x12cee20?)
        github.com/restic/restic/internal/backend/limiter/static_limiter.go:63 +0x1f fp=0xc0009a46d0 sp=0xc0009a46b8 pc=0xd7fd1f
net/http.send(0xc000318800, {0x12cee20, 0xc000119740}, {0x10d29c0?, 0x1?, 0x0?})
        net/http/client.go:251 +0x5f7 fp=0xc0009a48c8 sp=0xc0009a46d0 pc=0x699217
net/http.(*Client).send(0xc000119950, 0xc000318800, {0x0?, 0x0?, 0x0?})
        net/http/client.go:175 +0x9b fp=0xc0009a4940 sp=0xc0009a48c8 pc=0x698a9b
net/http.(*Client).do(0xc000119950, 0xc000318800)
        net/http/client.go:715 +0x8fc fp=0xc0009a4b30 sp=0xc0009a4940 pc=0x69ae1c
net/http.(*Client).Do(...)
        net/http/client.go:581
github.com/minio/minio-go/v7.(*Client).do(0xc000328b00, 0xc00004a041?)
        github.com/minio/minio-go/v7@v7.0.47/api.go:504 +0xb3 fp=0xc0009a4c98 sp=0xc0009a4b30 pc=0x9554f3
github.com/minio/minio-go/v7.(*Client).executeMethod(0xc000328b00, {0x12d66d0, 0xc00030fe00}, {0x10fb041, 0x3}, {0x0, {0xc00004a041, 0x5}, {0x0, 0x0}, ...})
        github.com/minio/minio-go/v7@v7.0.47/api.go:620 +0x952 fp=0xc0009a5388 sp=0xc0009a4c98 pc=0x956272
github.com/minio/minio-go/v7.(*Client).listObjectsV2Query(_, {_, _}, {_, _}, {_, _}, {_, _}, 0x1, ...)
        github.com/minio/minio-go/v7@v7.0.47/api-list.go:221 +0x77f fp=0xc0009a5848 sp=0xc0009a5388 pc=0x928c5f
github.com/minio/minio-go/v7.(*Client).listObjectsV2.func2(0xc0001153e0)
        github.com/minio/minio-go/v7@v7.0.47/api-list.go:105 +0x22c fp=0xc0009a5fc8 sp=0xc0009a5848 pc=0x927d2c
github.com/minio/minio-go/v7.(*Client).listObjectsV2.func5()
        github.com/minio/minio-go/v7@v7.0.47/api-list.go:156 +0x2a fp=0xc0009a5fe0 sp=0xc0009a5fc8 pc=0x927aca
runtime.goexit()
        runtime/asm_amd64.s:1594 +0x1 fp=0xc0009a5fe8 sp=0xc0009a5fe0 pc=0x466e21
created by github.com/minio/minio-go/v7.(*Client).listObjectsV2
        github.com/minio/minio-go/v7@v7.0.47/api-list.go:99 +0x519

That’s likely caused by a stuck network connection. We really need some timeouts there…

Thanks for checking. I assume this issue already covers it :+1: