Robustness of Repository (issue resolved)

Hello All,

I intend to replace my old backup scripts (based on rsync, an ssh tunnel and nfs) by a state-of-the-art solution.
Main use case is to backup three computers via network to an external server. Root access to all devices is available.
Restic looks like an almost perfect match.

But: I am concerned about repository robustness and robustness of the backup process in general.
For testing purposes I interrupted the backup process and tried to resume afterwards.
In most of my test cases resuming the backup fails.

My assumption was that a restic unlock followed by a restic rebuild-index should be enough before restarting the backup.
Apparently this is not the case.
To me it looks like the repository is still in a state which is not usable.

root@minilin:~# ./restic --password-file rm-repo-pw --exclude-file rm-repo-exclude -r sftp:backupserver:rm-repo --verbose backup /mnt/rechenmonster/
open repository
repository 1452324a opened successfully, password is correct
lock repository
load index files
start scan on [/mnt/rechenmonster/]
start backup on [/mnt/rechenmonster/]
scan finished in 90.723s: 63304 files, 211.953 GiB
uploaded intermediate index 9edcd9f4
subprocess ssh: Write failed: Broken pipeotal 63304 files 211.953 GiB, 0 errors ETA 1:42:34
Save(<lock/e48056334c>) returned error, retrying after 639.631388ms: ssh command exited: exit status 255
Save(<lock/e48056334c>) returned error, retrying after 1.03550331s: ssh command exited: exit status 255
Save(<lock/e48056334c>) returned error, retrying after 1.549049741s: ssh command exited: exit status 255
Save(<lock/e48056334c>) returned error, retrying after 2.510793368s: ssh command exited: exit status 255
Save(<lock/e48056334c>) returned error, retrying after 3.752043085s: ssh command exited: exit status 255
Save(<lock/e48056334c>) returned error, retrying after 2.994066285s: ssh command exited: exit status 255
Save(<lock/e48056334c>) returned error, retrying after 4.514784368s: ssh command exited: exit status 255
Save(<lock/e48056334c>) returned error, retrying after 11.648304725s: ssh command exited: exit status 255
Save(<lock/e48056334c>) returned error, retrying after 14.114089155s: ssh command exited: exit status 255
Save(<lock/e48056334c>) returned error, retrying after 18.06126997s: ssh command exited: exit status 255
unable to refresh lock: ssh command exited: exit status 25511.953 GiB, 0 errors ETA 1:29:16
Save(<lock/72a8acf055>) returned error, retrying after 508.546024ms: ssh command exited: exit status 255
Save(<lock/72a8acf055>) returned error, retrying after 814.253356ms: ssh command exited: exit status 255
[30:01] 28.92% 13415 files 61.292 GiB, total 63304 files 211.953 GiB, 0 errors ETA 1:13:52
Save(<lock/72a8acf055>) returned error, retrying after 2.165714921s: ssh command exited: exit status 255
Save(<lock/72a8acf055>) returned error, retrying after 2.541480516s: ssh command exited: exit status 255
Save(<lock/72a8acf055>) returned error, retrying after 5.429713763s: ssh command exited: exit status 255
Save(<lock/72a8acf055>) returned error, retrying after 4.530872255s: ssh command exited: exit status 255
Save(<lock/72a8acf055>) returned error, retrying after 11.165183953s: ssh command exited: exit status 255
Save(<lock/72a8acf055>) returned error, retrying after 12.267683116s: ssh command exited: exit status 255
Save(<lock/72a8acf055>) returned error, retrying after 28.369074974s: ssh command exited: exit status 255
unable to refresh lock: ssh command exited: exit status 25511.953 GiB, 0 errors ETA 1:11:09
Save(<data/5ad029a2a4>) returned error, retrying after 343.080735ms: ssh command exited: exit status 255
Save(<data/5ad029a2a4>) returned error, retrying after 813.277506ms: ssh command exited: exit status 255
Save(<data/5ad029a2a4>) returned error, retrying after 1.506813376s: ssh command exited: exit status 255
Save(<data/5ad029a2a4>) returned error, retrying after 1.211843837s: ssh command exited: exit status 255
Save(<data/5ad029a2a4>) returned error, retrying after 1.752838669s: ssh command exited: exit status 255
Save(<data/5ad029a2a4>) returned error, retrying after 2.130523529s: ssh command exited: exit status 255
Save(<data/5ad029a2a4>) returned error, retrying after 3.921231028s: ssh command exited: exit status 255
Save(<data/5ad029a2a4>) returned error, retrying after 9.039136062s: ssh command exited: exit status 255
Save(<data/5ad029a2a4>) returned error, retrying after 17.5063102s: ssh command exited: exit status 255
Save(<data/5ad029a2a4>) returned error, retrying after 12.589100081s: ssh command exited: exit status 255
Remove(<lock/4d7f027202>) returned error, retrying after 251.417301ms: ssh command exited: exit status 255
Remove(<lock/4d7f027202>) returned error, retrying after 412.907106ms: ssh command exited: exit status 255
Remove(<lock/4d7f027202>) returned error, retrying after 764.639135ms: ssh command exited: exit status 255
Remove(<lock/4d7f027202>) returned error, retrying after 915.773596ms: ssh command exited: exit status 255
Remove(<lock/4d7f027202>) returned error, retrying after 3.716604584s: ssh command exited: exit status 255
Remove(<lock/4d7f027202>) returned error, retrying after 3.404711931s: ssh command exited: exit status 255
Remove(<lock/4d7f027202>) returned error, retrying after 4.714168987s: ssh command exited: exit status 255
Remove(<lock/4d7f027202>) returned error, retrying after 6.37924004s: ssh command exited: exit status 255
Remove(<lock/4d7f027202>) returned error, retrying after 9.267381356s: ssh command exited: exit status 255
Remove(<lock/4d7f027202>) returned error, retrying after 22.90579546s: ssh command exited: exit status 255
Fatal: unable to save snapshot: ssh command exited: exit status 255iB, 0 errors ETA 1:14:51
Remove(<lock/4d7f027202>) returned error, retrying after 398.60313ms: ssh command exited: exit status 255
Remove(<lock/4d7f027202>) returned error, retrying after 1.033939473s: ssh command exited: exit status 255
Remove(<lock/4d7f027202>) returned error, retrying after 1.504770267s: ssh command exited: exit status 255
Remove(<lock/4d7f027202>) returned error, retrying after 1.79081597s: ssh command exited: exit status 255
Remove(<lock/4d7f027202>) returned error, retrying after 2.504306698s: ssh command exited: exit status 255
Remove(<lock/4d7f027202>) returned error, retrying after 5.631618235s: ssh command exited: exit status 255
Remove(<lock/4d7f027202>) returned error, retrying after 7.325304333s: ssh command exited: exit status 255
Remove(<lock/4d7f027202>) returned error, retrying after 11.199762921s: ssh command exited: exit status 255
Remove(<lock/4d7f027202>) returned error, retrying after 19.211147779s: ssh command exited: exit status 255
Remove(<lock/4d7f027202>) returned error, retrying after 26.490950035s: ssh command exited: exit status 255
error in cleanup handler: ssh command exited: exit status 255

root@minilin:~# ./restic --password-file rm-repo-pw --exclude-file rm-repo-exclude -r sftp:backupserver:rm-repo --verbose backup /mnt/rechenmonster/
open repository
repository 1452324a opened successfully, password is correct
lock repository
load index files
start scan on [/mnt/rechenmonster/]
start backup on [/mnt/rechenmonster/]
scan finished in 122.380s: 63304 files, 211.953 GiB
subprocess ssh: Write failed: Broken pipe 63304 files 211.953 GiB, 0 errors ETA 1:32:29
Save(<lock/44e973c1fd>) returned error, retrying after 534.843163ms: ssh command exited: exit status 255
Save(<lock/44e973c1fd>) returned error, retrying after 866.13663ms: ssh command exited: exit status 255
Save(<lock/44e973c1fd>) returned error, retrying after 1.140025974s: ssh command exited: exit status 255
Save(<lock/44e973c1fd>) returned error, retrying after 1.666121786s: ssh command exited: exit status 255
Save(<lock/44e973c1fd>) returned error, retrying after 1.64631936s: ssh command exited: exit status 255
Save(<lock/44e973c1fd>) returned error, retrying after 4.380748912s: ssh command exited: exit status 255
Save(<lock/44e973c1fd>) returned error, retrying after 6.573224506s: ssh command exited: exit status 255
[5:17] 6.29% 8221 files 13.327 GiB, total 63304 files 211.953 GiB, 0 errors ETA 1:18:57
Save(<lock/44e973c1fd>) returned error, retrying after 17.045598589s: ssh command exited: exit status 255
Save(<lock/44e973c1fd>) returned error, retrying after 12.830693829s: ssh command exited: exit status 255
unable to refresh lock: ssh command exited: exit status 255953 GiB, 0 errors ETA 1:16:06
Save(<data/0ace7664f9>) returned error, retrying after 549.495525ms: ssh command exited: exit status 255
Save(<data/0ace7664f9>) returned error, retrying after 419.53713ms: ssh command exited: exit status 255derfrankreich/WolleBilderfrankreich 007.jpg
Save(<data/0ace7664f9>) returned error, retrying after 1.010968958s: ssh command exited: exit status 255
Save(<data/0ace7664f9>) returned error, retrying after 1.153458669s: ssh command exited: exit status 255
Save(<data/0ace7664f9>) returned error, retrying after 2.923233639s: ssh command exited: exit status 255
Save(<data/0ace7664f9>) returned error, retrying after 5.669796535s: ssh command exited: exit status 255
Save(<data/0ace7664f9>) returned error, retrying after 7.477980579s: ssh command exited: exit status 255
Save(<data/0ace7664f9>) returned error, retrying after 6.085261698s: ssh command exited: exit status 255
Save(<data/0ace7664f9>) returned error, retrying after 14.933166759s: ssh command exited: exit status 255
Save(<data/0ace7664f9>) returned error, retrying after 15.308642956s: ssh command exited: exit status 255
Remove(<lock/a5f408e26b>) returned error, retrying after 600.706632ms: ssh command exited: exit status 255
Remove(<lock/a5f408e26b>) returned error, retrying after 1.091228601s: ssh command exited: exit status 255
Remove(<lock/a5f408e26b>) returned error, retrying after 1.492707128s: ssh command exited: exit status 255
Remove(<lock/a5f408e26b>) returned error, retrying after 847.919561ms: ssh command exited: exit status 255
Remove(<lock/a5f408e26b>) returned error, retrying after 2.926447392s: ssh command exited: exit status 255
Remove(<lock/a5f408e26b>) returned error, retrying after 2.436503605s: ssh command exited: exit status 255
Remove(<lock/a5f408e26b>) returned error, retrying after 6.33416481s: ssh command exited: exit status 255
Remove(<lock/a5f408e26b>) returned error, retrying after 4.493453035s: ssh command exited: exit status 255
Remove(<lock/a5f408e26b>) returned error, retrying after 17.82109473s: ssh command exited: exit status 255
Remove(<lock/a5f408e26b>) returned error, retrying after 24.20504431s: ssh command exited: exit status 255
Fatal: unable to save snapshot: ssh command exited: exit status 255 0 errors ETA 1:32:20
Remove(<lock/a5f408e26b>) returned error, retrying after 638.496395ms: ssh command exited: exit status 255
Remove(<lock/a5f408e26b>) returned error, retrying after 873.597202ms: ssh command exited: exit status 255
Remove(<lock/a5f408e26b>) returned error, retrying after 1.458896684s: ssh command exited: exit status 255
Remove(<lock/a5f408e26b>) returned error, retrying after 1.109790543s: ssh command exited: exit status 255
Remove(<lock/a5f408e26b>) returned error, retrying after 1.780625266s: ssh command exited: exit status 255
Remove(<lock/a5f408e26b>) returned error, retrying after 1.988941899s: ssh command exited: exit status 255
Remove(<lock/a5f408e26b>) returned error, retrying after 3.247499977s: ssh command exited: exit status 255
Remove(<lock/a5f408e26b>) returned error, retrying after 11.74492638s: ssh command exited: exit status 255
Remove(<lock/a5f408e26b>) returned error, retrying after 7.512791006s: ssh command exited: exit status 255
Remove(<lock/a5f408e26b>) returned error, retrying after 25.917067357s: ssh command exited: exit status 255
error in cleanup handler: ssh command exited: exit status 255

Remark: during these runs the network connection was stable and the backup sever was up and running all time.

Based on your experience / expertise:
Is it assured that existing data in a repository never gets corrupted in case of a connection loss during an operation (backup, rebuild-index, prune etc.)?
What do I have to do to be able to safely continue operations after a connection loss?

Best regards

Judging by the logs restic has issues to refresh the lock. I’ve never seen this before. Can you provide more details about your setup? Are you using a write once filesystem or file locking?

rebuild-index is never necessary in this situation, and unlock is rarely necessary. If you “cleanly” interrupt the backup (SIGINT, SIGTERM) then restic tries to clean up the lock file before terminating. You’d only have to use unlock if something prevented restic from deleting its lock file, such as a network error.

This usually indicates a problem connecting to the remote. Double-check that you can actually ssh to the machine using the same hostname, then try running restic again. (Maybe the DNS server wasn’t returning a record for backupserver?)

Yes. By design, backup only adds new files to the repository and never modifies or deletes existing files (except its own lock file). File names are the SHA256 sum of the file’s contents, so modifying a file isn’t even possible without completely invalidating it.

It is possible for partially-written files to be left behind if the connection is interrupted in the middle of writing a file. In that case, you would have a file whose contents does not hash to its filename, but that file isn’t actually referenced by anything. In this situation, I think the only commands that would complain about the partial file are rebuild-index and check; other commands like backup, restore, snapshots etc. will not even notice the presence of the partial file. Running prune will delete all partial files after verifying that all snapshots are fully intact.

You don’t have to do anything extra. (At some point you could run prune to clean up any partial packs, but this is just a garbage collection task.)

The errors you are seeing are likely network-related and external to restic.

1 Like

Hi 764287,

here are my setup details.

Sender:
restic 0.9.4 compiled with go1.11.4 on linux/amd64
Debian Linux 7 running in a Virtualbox VM on Windows
Linux version 3.2.0-4-amd64 (debian-kernel@lists.debian.org) (gcc version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian 3.2.65-1+deb7u1
File system to be backuped mounted using cifs (ro,relatime,sec=ntlm,uid=0,noforceuid,gid=0,noforcegid,file_mode=0755,dir_mode=0755,nounix,rsize=61440,wsize=65536,actimeo=1)
OpenSSH_6.0p1 Debian-4+deb7u2, OpenSSL 1.0.1e 11 Feb 2013

Receiver:
Debian Linux 9
Linux version 4.9.0-8-amd64 (debian-kernel@lists.debian.org) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1) ) #1 SMP Debian 4.9.144-3.1 (2019-02-19)
File system type ext4 mounted (rw,relatime,errors=remount-ro,data=ordered)
OpenSSH_7.4p1 Debian-10+deb9u6, OpenSSL 1.0.2r 26 Feb 2019

Remarks:

  • The sending VM is not exposed to the internet hence using an outdated Debian version is not a security issue here.
  • The source Windows file system is a read-only “shadow copy”.

Hi cdhowie,

Thanks for your explanations.

The connection is monitored and was all good during the two runs. DNS resolution is not used. backupserver is an ssh alias pointing to an IPv4 address.

Sounds good - this is exactly what I was looking for. Now I just need to get restic to work properly here.

No, unfortunately these are definitely no network issues. That would be simple to fix. The root cause must be something else. I tried this multiple times and I am quite sure that the issue is reproducible here.

What I can suggest for the next troubleshooting step is using sftp from the machine running restic to the backup server, and trying to upload files into the lock directory and then delete them. Does this work?

I disagree that a prune is not a necessary recovery step. Early on when I was using Restic I had a similar situation where a network interruption left behind some partially written files. Since on subsequent backup operations the files of a matching hash-name already existed, the partially-written files were ignored (not repaired/replaced) so the hole in my backups persisted until I removed those files allowing them to be recreated.

I’m guessing that’s been fixed by now, but I can’t say for sure.

A backup operation should only look at indexes (IIRC) and an index should never be generated for an incomplete pack.

Hi Sitwon,

This would very well explain my observations. After a prune I am able to resume operations without any further issues.

In the meantime I updated the client that is running restic from Debian 7 to latest Debian 9. The issue still occurs and is apparently not related to the ssh version or system environment used.

This strikes me as a potential bug in restic. Prune shouldn’t be necessary to recover; if it is, something else is wrong.

That’s correct, a pack file in the repo is only ever mentioned in an index file after it has been uploaded successfully.

Oh? That’s interesting, can you maybe elaborate a bit? Restic does not re-use file names in the repo (except for the config file), so it should never attempt to write the same file again in a subsequent run.

I understood roughly the following, can you extend this please?

  • You ran restic, saving data to a remote repository, the network connection was interrupted and restic aborted.
  • A few files were left half-written in the repo.
  • You ran restic again, it started roughly where it left off and wrote the same files in the repo(?)
  • There was a “hole” in your backups (I don’t understand what you mean by that)?

Thanks!

This is the underlying issue here: restic runs ssh to open the connection to the sftp server, and the ssh subprocess died. We need to find out why it was terminated, and resolve this. It has (almost) nothing to do with restic :slight_smile:

It’d be great if you could reproduce the issue, passing the following parameters to get verbose logs from the SSH client:

-o sftp.command="ssh -v -s sftp backupserver"

Maybe we can see something here…

Hello all,

I may be just pulling straws here, but is it possible the firewall on the receiving server is blocking or limiting the ssh requests? I’ve had a similar issue with rclone when syncing over ssh when I used ufw to limit the number of ssh connections that could be made per time ( I think I had something like limit ssh in my ufw status results ) . I had limited it to try to prevent invalid connections, but the way ufw limit rules work, enough valid connections appeared to be able to cause the firewall to block. rsync might only make one connection ( due to working as a single thread ), whereas restic might make multiple per backup.

Also, is it possible that you have your ssh connections time-limited for security? If restic keeps sessions going for long enough, maybe this would explain the dropped connections. Again this seems to be a security practice some people use.

One more guess… Is it possible the storage of the external server is corrupt? Can you run an fsck on it?

Good luck, I hope you figure this out!
jedi453

1 Like

I finally identified the root cause of the problem by increasing the sshd log level on server side. SSH connection was closed by the server after a few minutes of idle time. Apparently this happened while restic was processing huge amounts of unchanged data.

Adding the following lines to the client’s .ssh/config file resolved the problem

ServerAliveInterval 60
ServerAliveCountMax 240

(smaller values for ServerAliveCountMax will most probably be sufficient as well. Just to be sure this never happens again I selected a large number)

As my sshd server vastly uses Debian default configuration values this problem might effect other users as well.

BTW: Increasing log verbosity of the SSH client did not reveal any valuable information regarding the connection loss.
If someone is interested I can make the logs available anyhow.

2 Likes

Great find! I can definitely see this happening, and is likely to be a problem for someone else in the future. Thanks for documenting it here.

@fd0, I wonder if there is some way that we can set such defaults within restic?

Sure, we can pass these options to the ssh client we’re calling… Should be easy to do. Anybody up for a PR? Hit me up for the details if you need help :slight_smile:

@devrandom I’m wondering, did the server have any setting to disconnect idle clients? I’m not used to ssh servers doing that, sometimes it’s a (badly) configured “firewall” in between client and server…

The server is using build-in default values:

root@backupserver:~# sshd -T
[...]
clientaliveinterval 0
clientalivecountmax 3
[...]
root@backupserver:~# cat /etc/ssh/sshd_config | grep alive
root@backupserver:~# 

Server version:

 OpenSSH_7.4p1 Debian-10+deb9u6, OpenSSL 1.0.2r  26 Feb 2019

The service provider states:

Wir haben kein Intrusion- oder Denial of Service Detection System aktiv. Eine Unterbrechung von unserer Seite wurde nicht initiiert.

1 Like