I have an issue that we’ve been struggling with for a few months now, have even been working with Microsoft support, with no root cause or workarounds yet.
Restic version is 0.13.1, but we have also tried using 0.14.0, which did update to the latest Azure SDK library versions.
The dataset is 10TB, equally spread across 10 volumes, in a Kubernetes environment. If I back up or restore in the same region, it takes awhile but we don’t run into any issues - can do this repeatedly without problems.
We have use cases for doing cross region backup and/or restore however, and this is where we start to run into problems. It seems to work for smaller data sets, let’s say 1TB, possibly more, I have not yet determined where it starts to fail. However, at 10TB, is is never successful.
Focusing on the restore, since I run into problems more frequently, and more quickly with restore, it starts with seeing a number of read: connection timed out errors, like this:
Load(<data/742be51b30>, 5233819, 0) returned error, retrying after 370.757544ms: read tcp xx.xx.xx.xx:41138->xx.xx.xx.xx:443: read: connection timed out
which is inevitably followed by the 403 error:
ignoring error for /datavol2/mf1/standard-medium-files.0.118: storage: service returned error: StatusCode=403, ErrorCode=AuthenticationFailed, ErrorMessage=Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature. RequestId:ca3cdcdd-801e-0086-4332-35644e000000 Time:2023-01-31T05:12:31.8607302Z, RequestInitiated=Tue, 31 Jan 2023 05:12:31 GMT, RequestId=ca3cdcdd-801e-0086-4332-35644e000000, API Version=, QueryParameterName=, QueryParameterValue=
This could happen anywhere from 10 minutes into the restore or might be some number of hours into the restore. I do this this on backup, but less frequently.
A few pertinent details:
We wrap restic in an application that has basic logic for starting backup/restore, checking progress, determining success failure. We start a Kubernetes job for each volume, so in my case, we start 10 Kubernetes jobs (which in turn start 10 k8s pods), each running restic against a specific volume.
Our application checks the final output from restic and will determine if it was success (no errors), or in the case of errors, it will fail the k8s job and restart another one, which will pick up where the previous one stopped. Final output looks like this (with varying number of errors, not always the same number)
Fatal: There were 18 errors
I have played around with the # of backend connections (to azure storage account). Have let it use the default of 5, set it to 1, tried 10. This makes no difference in whether or not I encounter these errors.
We have turned on detailed debugging logs for restic and there’s an interesting pattern. I observe that requests with a specific id, let’s say 1111, are moving along fine for many minutes - these are multi-part request/responses. Then there’s a 15-17 minute gap, followed by the 403/auth error. We cannot explain this one and have been getting packet trace information (and sent that to MSFT support). Complete mystery on this one at the moment.
We have confirmed both through documentation, as well as discussion with MSFT support, that they do invalidate an auth header after 15 minutes for security purposes. So the fact that we’re getting a 403/auth error makes sense. We understand why we’re getting them, but not the 15 minute gap where requests seem to go to a black hole. Again, this ONLY happens when backing up or restoring across regions, never in the same region.
Wondering if anyone else has encountered this or might have any ideas on how to proceed with debugging.