None of the backend retry methods work in my case for residential IP

MexHigh · June 23, 2024, 10:53am

I’m using restic with resticprofile to backup my stuff to a Hetzner Storage Box via SFTP. The restic client is within a residential network, which force-reconnects every night at 4am to aquire a new IPv4-address (however, the backups take place via IPv6). The connection is interrupted for between 2 and at most 5 minutes during the forced disconnection.

My restic call (as called from resticprofile): restic backup --password-file=password.txt --repo=rclone:hetzner-backup:restic/bc-server --tag=auto --tag=system --verbose /home /root /boot /etc /var/www /var/dockervols

My problem is, that restic always fails at 4am every night. I’ve tried sftp directly (which does not have a retry mechanism apperently as mentioned in sftp backend does not reconnect · Issue #353 · restic/restic · GitHub). I’ve also tried sftp via rclone, but this fails as well with this message:

rclone: 2024/06/23 04:18:07 ERROR : sftp://redacted@redacted.your-storagebox.de:23/restic/bc-server: Discarding closed SSH connection: read tcp [redacted]:49396->[redacted]:23: read: connection timed out
rclone: 2024/06/23 04:18:07 ERROR : sftp://redacted@redacted.your-storagebox.de:23/restic/bc-server: Discarding closed SSH connection: read tcp [redacted]:34286->[redacted]:23: read: connection timed out
rclone: 2024/06/23 04:18:07 ERROR : sftp://redacted@redacted.your-storagebox.de:23/restic/bc-server: Discarding closed SSH connection: read tcp [redacted]:32836->[redacted]:23: read: network is unreachable
rclone: 2024/06/23 04:18:08 ERROR : data/b9/b94137e03d0173dced345378e6eab2f50d9a9eefe7eb7a1372c2f55feeb9f042: Post request put error: Update ReadFrom failed: connection lost
rclone: 2024/06/23 04:18:08 ERROR : data/b9/b94137e03d0173dced345378e6eab2f50d9a9eefe7eb7a1372c2f55feeb9f042: Post request rcat error: Update ReadFrom failed: connection lost
rclone: 2024/06/23 04:18:08 ERROR : data/e0/e0271416bd38b650d6656bf6fa03f9ba4ef90e1b827635c97641ef145423ecc9: Post request put error: Update ReadFrom failed: connection lost
rclone: 2024/06/23 04:18:08 ERROR : data/e0/e0271416bd38b650d6656bf6fa03f9ba4ef90e1b827635c97641ef145423ecc9: Post request rcat error: Update ReadFrom failed: connection lost
rclone: 2024/06/23 04:22:35 ERROR : locks/90a0c1b687cd068939d31a04a73c8714be86571b61151a1dad0631f18f346f5c: Post request put error: Put mkParentDir failed: mkdir dirExists failed: dirExists stat failed: connection lost
rclone: 2024/06/23 04:22:35 ERROR : locks/90a0c1b687cd068939d31a04a73c8714be86571b61151a1dad0631f18f346f5c: Post request rcat error: Put mkParentDir failed: mkdir dirExists failed: dirExists stat failed: connection lost
unable to refresh lock: server response unexpected: 500 Internal Server Error (500)
Fatal: unable to save snapshot: server response unexpected: 500 Internal Server Error (500)

There is also this PR Rework backend retries by MichaelEischer · Pull Request #4784 · restic/restic · GitHub, but I don’t know how this helps or how I configure the retry count.

Some more info:

Restic version: restic 0.16.4 compiled with go1.21.6 on linux/amd64

My .ssh/config:

Host redacted.your-storagebox.de
  Port 23
  User redacted
  IdentityFile ~/.ssh/id_ed25519
  ServerAliveInterval 60
  ServerAliveCountMax 240

My .config/rclone/rclone:

[hetzner-backup]
type = sftp
host = redacted.your-storagebox.de
user = redacted
port = 23
key_file = /root/.ssh/id_ed25519
key_use_agent = false
idle_timeout = 0

Do you know I get this to run without restarting restic every night again by hand? A script wouldn’t be suitable for me as the scanning process for every restic start takes very long, which I want to avoid. It doesn’t get in my head why restic does not do infinite retries by default, as it should be designed to run for a long time.

Any ideas? Thanks in advance.

MichaelEischer · June 26, 2024, 5:14pm

The retries in restic 0.16.4 are probably too short to cover an interrupted connection for more than a few minutes. With the linked PR (to be included in restic 0.17.0) the timeout will increase to about 15 minutes.

There is currently no support to reopen an interrupted sftp connection in restic. The plan is to add that support in restic 0.18.0 (the release after 0.17.0). So there’s no quick solution using restic alone other than switching to a different backend.

That error message looks like rclone did not reopen the interrupted connection either. I don’t know whether there’s support for reconnecting to an sftp server.

MexHigh · June 26, 2024, 7:05pm

Ok thank you for this information.

I already tried tracking this down myself. When running rclone by itself with rclone serve restic hetzner-backzup:restic/bc-server. This connection was not interupted by the force reconnect. Maybe restic isn’t able to react to the 500 status code that the rclone webserver is sending during sftp downtime? Maybe this might be a bug.

However, if anyone else is having the same issue: I managed to build a workaround by building a small bash script around restic, which reacts to exit code 1 (also occurs on sftp connection disruptions, see here: Backing up — restic 0.16.4 documentation). I think this is ok as a workaround.

betatester77 · June 27, 2024, 6:08am

Just curious, does your backup task take so long that you can’t do it before or after the forced reconnect?

Also I remember, with some routers you can schedule a reconnect on your own at a time you like. Usually the isp just wants to have it done once in 24 hours, so if your router does it by itself at a time you like, the isp usually won’t do it again the next 24 hours.

MexHigh · June 27, 2024, 8:00am

Of course I COULD do it in between, but in my opinion, a tool creating a high value asset such as backups over sftp should be able to handle such issues. Hence the post.

I observed this behaviour during the initial Backup, which took me around two weeks to complete, so there were enough opportunities for me to manually restart restic