403/Authentication Errors with large data cross region backup and restore in Azure

I have an issue that we’ve been struggling with for a few months now, have even been working with Microsoft support, with no root cause or workarounds yet.

Restic version is 0.13.1, but we have also tried using 0.14.0, which did update to the latest Azure SDK library versions.

The dataset is 10TB, equally spread across 10 volumes, in a Kubernetes environment. If I back up or restore in the same region, it takes awhile but we don’t run into any issues - can do this repeatedly without problems.

We have use cases for doing cross region backup and/or restore however, and this is where we start to run into problems. It seems to work for smaller data sets, let’s say 1TB, possibly more, I have not yet determined where it starts to fail. However, at 10TB, is is never successful.

Focusing on the restore, since I run into problems more frequently, and more quickly with restore, it starts with seeing a number of read: connection timed out errors, like this:

Load(<data/742be51b30>, 5233819, 0) returned error, retrying after 370.757544ms: read tcp xx.xx.xx.xx:41138->xx.xx.xx.xx:443: read: connection timed out

which is inevitably followed by the 403 error:

ignoring error for /datavol2/mf1/standard-medium-files.0.118: storage: service returned error: StatusCode=403, ErrorCode=AuthenticationFailed, ErrorMessage=Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
RequestId:ca3cdcdd-801e-0086-4332-35644e000000
Time:2023-01-31T05:12:31.8607302Z, RequestInitiated=Tue, 31 Jan 2023 05:12:31 GMT, RequestId=ca3cdcdd-801e-0086-4332-35644e000000, API Version=, QueryParameterName=, QueryParameterValue=

This could happen anywhere from 10 minutes into the restore or might be some number of hours into the restore. I do this this on backup, but less frequently.

A few pertinent details:

  • We wrap restic in an application that has basic logic for starting backup/restore, checking progress, determining success failure. We start a Kubernetes job for each volume, so in my case, we start 10 Kubernetes jobs (which in turn start 10 k8s pods), each running restic against a specific volume.

  • Our application checks the final output from restic and will determine if it was success (no errors), or in the case of errors, it will fail the k8s job and restart another one, which will pick up where the previous one stopped. Final output looks like this (with varying number of errors, not always the same number)

    Fatal: There were 18 errors

  • I have played around with the # of backend connections (to azure storage account). Have let it use the default of 5, set it to 1, tried 10. This makes no difference in whether or not I encounter these errors.

  • We have turned on detailed debugging logs for restic and there’s an interesting pattern. I observe that requests with a specific id, let’s say 1111, are moving along fine for many minutes - these are multi-part request/responses. Then there’s a 15-17 minute gap, followed by the 403/auth error. We cannot explain this one and have been getting packet trace information (and sent that to MSFT support). Complete mystery on this one at the moment.

  • We have confirmed both through documentation, as well as discussion with MSFT support, that they do invalidate an auth header after 15 minutes for security purposes. So the fact that we’re getting a 403/auth error makes sense. We understand why we’re getting them, but not the 15 minute gap where requests seem to go to a black hole. Again, this ONLY happens when backing up or restoring across regions, never in the same region.

Wondering if anyone else has encountered this or might have any ideas on how to proceed with debugging.

Thanks,
Kevin

restic 0.15.0 has switch to a different Azure SDK implementation. Does that maybe help?

I’m not sure what causes the stuck connections (likely something on the TCP level?), but a 15 minutes delay exceeds the maximum timespan during which failed requests are retried. That is why the files aren’t restored properly.

I guess the files which cannot be loaded are different every time? Do the failed request show up from time to time or in bulk?

Edit: It would be possible to patch the retry code to increase the retry timespan, but that won’t fix the underlying problem… (Add bo.MaxElapsedTime = 2 * bo.MaxElapsedTime after restic/backend_retry.go at 590eb9efd7c0bec491e496ea4c3e82231d8c208d · restic/restic · GitHub)

Thanks Michael. Good to know about 0.15.0 switching to a different Azure SDK implementation - it could help, we will likely give it a shot

As far as the 15 minute delay, one thing we noticed is that the requests were being sent without any context deadline. We have since modified the requests call to implement a 1 minute timeout. This has helped significantly, although we notice that there’s a large number of requests that hit the timeout, cancellation and retry. We’re going to try a 5 minute timeout instead.

In terms of your question about whether request show up at various times or in bulk, it’s completely random - it’s not the case where there’s some event that occurs and all instances time out simultaneously.

For your last comment about retry time, I think we’ve fixed this with setting a context deadline on the requests, where previously it would wait forever (and hit the 15 minute timeout on the Azure storage API side).

Yes, currently restic (unfortunately) doesn’t set proper timeouts for its requests. We definitely have to fix that. Although, I’ll have to look into how to do it such that slow network connections are still supported.

According to TCP and gRPC Failed Connection Timeouts (evanjones.ca) the 15 minutes delay are the result of a dropped connection at some point after sending the initial request.

1 Like

Thanks for that link, definitely informative.

We have implemented a hard-coded timeout in restic for the moment. A real fix would be to make that timeout a tunable. We have forked our own version of restic to do some tuning in other areas specific to ONTAP backend storage. However, this timeout is not something we’d want to own and would like to contribute it back if possible.

BTW, the timeout we initially set was 1 minute, and while that allowed the restore to complete, it did seem to be too agressive. Testing with 5 minute timeout now (until we have the ability to tune the timeout when running restic).