Restic hanging on backup

akrabu · October 12, 2024, 3:54am

Tried both current stable and the latest beta build:

restic 0.17.1 compiled with go1.23.1 on darwin/amd64
restic 0.17.1-dev (compiled manually) compiled with go1.22.0 on darwin/amd64

Command I ran was:

restic backup --host SnapRAID --cleanup-cache --skip-if-unchanged \
                ~/.SnapRAID \
                /Volumes/D1-6_SnapRAID/ \
                /Volumes/D2-4_SnapRAID/ \
                /Volumes/D3-4_SnapRAID/ \
                /Volumes/D4-6_SnapRAID/ \
                /Volumes/D5-4_SnapRAID/ \
                /Volumes/D6-6_SnapRAID/ \
                /Volumes/D7-4_SnapRAID/ \
                /Volumes/D8-6_SnapRAID/ \
                /Volumes/P1-8_SnapRAID \
                /Volumes/P2-8_SnapRAID

Video:
https://v.usetapes.com/TzLgg5mba1

SIGQUIT output:
https://hastebin.skyra.pw/iqavoweyok.swift

Not sure what’s going on. I already have a parent snapshot up from an earlier version of Restic. I haven’t changed that much. Disks aren’t being accessed, and nothing appears to be in the middle of uploading.

I’ve tried it twice now and it hangs at the same spot. Not seeing any disk errors or activity. The disks have spun down to sleep, in fact.

akrabu · October 12, 2024, 4:04pm

Let it run all night and got this:

repository 24344e18 opened (version 2, compression level max)
using parent snapshot e49b28d2
[0:08] 100.00%  189 / 189 index files loaded
subprocess ssh: Connection to xxx.your-storagebox.de closed by remote host.s ETA 311:51:30
Load(<data/b81965594c>, 0, 0) returned error, retrying after 847.865338ms: connection lost
Load(<data/b81965594c>, 0, 0) returned error, retrying after 1.242207049s: connection lost
Load(<data/b81965594c>, 0, 0) returned error, retrying after 5.458865343s: connection lost
Load(<data/b81965594c>, 0, 0) returned error, retrying after 7.025835455s: connection lost
Load(<data/b81965594c>, 0, 0) returned error, retrying after 8.671664267s: connection lost
Load(<data/b81965594c>, 0, 0) returned error, retrying after 47.125927738s: connection lost
Load(<data/b81965594c>, 0, 0) returned error, retrying after 58.192820294s: connection lost
Load(<data/b81965594c>, 0, 0) returned error, retrying after 30.157215504s: connection lost
Load(<data/b81965594c>, 0, 0) returned error, retrying after 1m7.391069992s: connection lost
Load(<data/b81965594c>, 0, 0) returned error, retrying after 1m13.10439733s: connection lost
Load(<data/b81965594c>, 0, 0) returned error, retrying after 1m9.518621274s: connection lost
Load(<data/b81965594c>, 0, 0) returned error, retrying after 1m19.54194737s: connection lost
Load(<data/b81965594c>, 0, 0) returned error, retrying after 1m6.297995775s: connection lost
Load(<data/b81965594c>, 0, 0) returned error, retrying after 1m8.21006911s: connection lost
Load(<data/b81965594c>, 0, 0) returned error, retrying after 1m9.523850956s: connection lost
Load(<data/b81965594c>, 0, 0) returned error, retrying after 1m13.703601312s: connection lost
Load(<data/b81965594c>, 0, 0) returned error, retrying after 1m15.132715584s: connection lost
Load(<data/b81965594c>, 0, 0) returned error, retrying after 1m1.001051135s: connection lost
Load(<data/b81965594c>, 0, 0) failed: connection lost
error: /Volumes/D4-6_SnapRAID/Audio: tree 72543d512431031eb55cdc77e44b64d58a2e1c1d20e939e6278ca462a3438f72 could not be loaded; the repository could be damaged: ReadFull(<data/b81965594c>): circuit breaker open for file <data/b81965594c>
[12:07:50] 27.34%  135870 files 992.518 GiB, total 422415 files 3.545 TiB, 1 errors ETA 13:11:00

I can’t tell if the connection loss caused the damage, or if the damage was already there. I just did a full check not that long ago, and the parent snapshot is pretty old.

Did a “find” on that tree:

Found tree 72543d512431031eb55cdc77e44b64d58a2e1c1d20e939e6278ca462a3438f72
 ... path /Volumes/D4-6_SnapRAID/Audio
 ... in snapshot e49b28d2 (2024-04-27 16:55:02)

I guess I should let it finish, then try another backup run, then do a check?

EDIT: It started having a ton of these errors so I quit it. A regular “check” says no errors. I had a slightly out-of-date local copy of my repo, duplicated it, and am now syncing the cloud backup to the local copy. Will run a full restic --read-data and see what it says.

EDIT2: A full read said everything was okay. I’m going to try backing up with the Rclone SFTP backend instead of Restic’s SFTP backend, and see what happens.

akrabu · October 14, 2024, 1:07am

So this backup completed just fine with rclone as the sftp backend. Not sure what was up with Restic’s implementation. Did another check --read-data and everything is fine.

damoclark · October 14, 2024, 12:27pm

I think this might be a hint. The remote host where you are storing your repo, and uploading via sftp disconnected from restic. Could it be that the host rebooted during your backup?

A disconnection notice like this means that sshd on your repo server deliberately closed the connection, likely due to sshd being stopped. Hence, the question about a reboot.

Try using restic without rclone again if you want to, and see whether the issue persists.

Otherwise, if it is working to your satisfaction via rclone, then as they say - all roads lead to Rome. Although some routes are longer than others.

akrabu · October 14, 2024, 6:20pm

It did that on multiple attempts, and only stopped when I used Rclone+SFTP instead. I’d prefer to use a native backend, if possible - generally that’s better. Just wondering if there’s an SFTP bug worth reporting. I can probably reproduce it.

I don’t doubt that the connection is getting closed by the remote - I don’t think that’s Restic’s fault. But there may be something to how Rclone recovers more gracefully than Restic’s native implementation, perhaps??

Also this part was especially concerning. I’m curious why it continued, instead of immediately stopping, if it couldn’t load part of the tree? If it thinks the repository is damaged, shouldn’t it stop trying to backup to it?

damoclark · October 14, 2024, 10:53pm

I’d suggest the more important question for you is “why”? Why is your server disconnecting? See if you can look up the logs on your server from sshd or sftpd to see what the reason might be for it to disconnect.

I don’t think your repository is damaged. You have already performed a check to validate this. This error message says it “could” be damaged because it couldn’t retrieve the tree blob that was located in the pack file that it was trying to download when ssh was disconnected.

akrabu · October 17, 2024, 2:01am

Unfortunately, can’t see those logs, as far as I can tell. All I know is, it doesn’t disconnect like that with Rclone+SFTP. That said, I’m finding Rclone+WebDAV to be just as stable, but much faster. Will probably stick with that for the default for now.

Yes but, what if it had continued and saved the snapshot? I canceled it myself, but it didn’t seem to be giving up - even though, in my mind, without being able to load part of the tree… shouldn’t it have given up at that point?

MichaelEischer · October 21, 2024, 7:45pm

That error shouldn’t be retried, which is definitely a bug. (Will be fixed by sftp: check for broken connection in Load/List operation by MichaelEischer · Pull Request #5101 · restic/restic · GitHub)

A partial backup is often better than no backup at all. I have written down a TODO somewhere to change the code to just reupload the problematic tree blob in that case.

Did you set the following options in your ssh config file?

ServerAliveInterval 60
ServerAliveCountMax 240

akrabu · October 23, 2024, 3:32am

I thought something smelled buggy. Was hoping you’d see this eventually haha

Fair. Would it have introduced corruption, or just a partial snapshot, I wonder?

I did not! I’ve admittedly never messed with my ssh config before…is the following sufficient? Mine didn’t exist so I had to create it.

Host *
  ServerAliveInterval 60
  ServerAliveCountMax 240

MichaelEischer · October 24, 2024, 7:47pm

The snapshot might end up referencing a missing tree blob. The part of the snapshot with the missing tree blob would be inaccessible.

It maybe a good idea to only set the options where relevant. Although they probably don’t do much harm. Also take a look at the snippets in Preparing a new repository — restic 0.17.1 documentation .