"context canceled" errors on current master

Hi :wave:

Recently I am randomly getting “context canceled” on big repositories. Here is one example:

repository 9cdb309a opened (version 2, compression level auto)
loading indexes...
[1:11] 100.00%  1434 / 1434 index files loaded
loading all snapshots...
finding data that is still in use for 882 snapshots
[0:06] 100.00%  882 / 882 snapshots
searching used packs...
collecting packs for deletion and repacking
List(data) returned error, retrying after 460.44061ms: Get "http://localhost:8888/xx/?continuation-token=ZGF0YS8wOS8wOTM1Mzg2MjI2MTcwYjg1NGM2ZGEzZmE1ZjNkODRjNGIzZGQ5YzNjZGNiNzlhNzc1ZmMxNmI3MDQxZjExMzNjW21pbmlvX2NhY2hlOnYyLGlkOjMxYTlmODkwLWZjYTQtNGEwNC1hOGQyLWNkYjUzZDRjYjQ5MCxwOjAsczowXQ%3D%3D&delimiter=&encoding-type=url&fetch-owner=true&list-type=2&prefix=data%2F": context canceled
[38:02] 19.47%  135000 / 693235 packs processed
Get "http://localhost:8888/xx/?continuation-token=ZGF0YS8zMS8zMTk4M2MzNGI1ZmIzZDI1MjY5OTQyNGVhNWIzZDhiMjYyODQxYmY1NmUxMmIwODFjNjgxOTdmYmQwYmQxZjRmW21pbmlvX2NhY2hlOnYyLGlkOjJiZDdmZWI3LTYwZTktNGU1MS1hNmMyLTBjYTE4N2U1MzFiZSxwOjAsczowXQ%3D%3D&delimiter=&encoding-type=url&fetch-owner=true&list-type=2&prefix=data%2F": context canceled

I suspected and ran the same command with RESTIC_FEATURES=http-timeouts=false and seems like it did not reproduce on next run (not 100% sure if it’s this one or I was lucky though).

Backend is s3/minio, restic version is:

restic 0.16.4-dev (compiled manually) compiled with go1.22.0 on linux/amd64
(revision 7ed560a20182a461b05d83eb6d7f5a7c5c837860)

Exact command has nothing unusual:

sudo -E /usr/local/bin/restic --cleanup-cache prune --max-unused 20% --max-repack-size 2G

Any tips to dig into for troubleshooting?

Yes, that looks like it is caused by backend: enforce that backend HTTP requests make progress by MichaelEischer · Pull Request #4792 · restic/restic · GitHub . That PR cancels requests that make no progress for 2 minutes. A single S3 List API call is supposed to yield 1000 files per request, which definitely shouldn’t take that long.

Based on the number of packs and the time, it means that list only returned 60 files per second. That is absurdly slow. Even at 700k files that step shouldn’t take much more than a few minutes.

Do you have any numbers how long collecting packs for deletion and repacking normally takes?

I’ve opened backend: Improve timeout error message by MichaelEischer · Pull Request #4844 · restic/restic · GitHub to improve the error message and increase the timeout from 2 to 5 minutes. Btw, the feature flag is by now called backend-error-redesign.

The failed List call was not retried as it exceeded the hardcoded limit of 15 minutes. With Rework backend retries by MichaelEischer · Pull Request #4784 · restic/restic · GitHub there’s always at least one retry, which should have been enough to allow prune to complete.

Sadly I don’t have exact timing, I only remember that step to take the most time. But prune operation finish under 4 hours (even with repo size ~12TB). It’s a busy storage though, it wouldn’t surprise me if it locks up for short periods.

Retries PR could solve this indeed, looks promising, thanks! Seems already merged, will test with current master.

One mind-bug though, I see multi-line errors sometimes. Is this one of the cases that reducing parallelism would help? (considering the output includes different timings/tokens):

2024-06-01 17:00:05,311 - restic_cleanup - DEBUG - Running command: /usr/local/bin/restic --quiet --cache-dir /data/backup/persistent_backups/restic_caches/xx --cleanup-cache prune --max-unused 20% --max-repack-size 100G
2024-06-01 17:18:58,537 - restic_cleanup - DEBUG - Return code: 130
2024-06-01 17:18:58,538 - restic_cleanup - WARNING - Command failed: /usr/local/bin/restic --quiet --cache-dir /data/backup/persistent_backups/restic_caches/xx --cleanup-cache prune --max-unused 20% --max-repack-size 100G
2024-06-01 17:18:58,538 - restic_cleanup - INFO - Error output:
List(data) returned error, retrying after 403.706985ms: Get "http://localhost:8888/xx/?continuation-token=ZGF0YS8wZi8wZjEzOGZkNzljMjI2YzAzZTk0NDEyNzc4YTk3YWZkZjhlYTUwOGU4ZTUwNThkYzQyYTg0NWM2MzcwNzIwY2U2W21pbmlvX2NhY2hlOnYyLGlkOjYzYTZiMjU0LTQ2ZDMtNDNmNi1iNTQzLTk1Zjc3NDQwOGRmYSxwOjAsczowXQ%3D%3D&delimiter=&encoding-type=url&fetch-owner=true&list-type=2&prefix=data%2F": context canceled
List(data) returned error, retrying after 851.042063ms: Get "http://localhost:8888/xx/?delimiter=&encoding-type=url&fetch-owner=true&list-type=2&prefix=data%2F": context canceled
List(data) returned error, retrying after 630.500438ms: Get "http://localhost:8888/xx/?continuation-token=ZGF0YS8wZi8wZjEzOGZkNzljMjI2YzAzZTk0NDEyNzc4YTk3YWZkZjhlYTUwOGU4ZTUwNThkYzQyYTg0NWM2MzcwNzIwY2U2W21pbmlvX2NhY2hlOnYyLGlkOmQwY2Y0NjRlLTk2OTctNGY0OC1iOWZiLWE2NmUxMmYyZTJiYyxwOjAsczowXQ%3D%3D&delimiter=&encoding-type=url&fetch-owner=true&list-type=2&prefix=data%2F": context canceled
List(data) returned error, retrying after 2.468405651s: Get "http://localhost:8888/xx/?continuation-token=ZGF0YS8wNy8wNzg1NTUyZGRkNjczNzY2MGY4ZWM0NmUzN2MxOWRlMGM4YmY0NzYyODhiN2FlMzZkNjA3MTgzZWNmMGYyMGY2W21pbmlvX2NhY2hlOnYyLHJldHVybjpd&delimiter=&encoding-type=url&fetch-owner=true&list-type=2&prefix=data%2F": context canceled
Get "http://localhost:8888/xx/?continuation-token=ZGF0YS8wZi8wZjEzOGZkNzljMjI2YzAzZTk0NDEyNzc4YTk3YWZkZjhlYTUwOGU4ZTUwNThkYzQyYTg0NWM2MzcwNzIwY2U2W21pbmlvX2NhY2hlOnYyLGlkOmYwYjA3MzlhLWIzYWYtNGEwNC1hMjVmLTQ0ZDY4N2Q0ZjkzZSxwOjAsczowXQ%3D%3D&delimiter=&encoding-type=url&fetch-owner=true&list-type=2&prefix=data%2F": context canceled

Hmm, the difference there is primarily that the List call timeouts started to occur within 15 minutes of the start of the collecting packs for deletion and repacking step. That looks like in your case the 2 minutes timeout for stuck connections is currently too low (will be increased to 5 minutes by the new PR). Reducing the load on the storage would likely also help, right now it appears to be really overloaded.

One follow up on this, applied the PR increasing it to 5 minutes to current master. It helped a lot, not seeing anymore. Thanks!

1 Like

One later addition, I saw this few times again even with 5 mins (only on prune operations). Testing with 15min instead. It eventually finishes without this protection so I’d say this can be a configurable value with a conservative default :thinking: