Prune fails on CIFS repo using go 1.14 build

underhillian · March 11, 2020, 3:50pm

I’m having a problem where certain operations (rebuild-index and prune, at least) fail. The problem is specific to remote repos mounted via CIFS (linux host and client) and versions of restic built with go 1.14. Restic built with a previous release of go (using the same restic source) continues to work fine, as do operations on local and S3 repos. Other restic operations (backup in particular) appear to be unaffected.

Specifically, a prune produces a long list of errors of the form

Load(<data/216ea5f2d2>, 591, 4564169) returned error, retrying after 720.254544ms: open /mnt/nas/redacted/reponame/data/21/216ea5f2d21b458a7b913609ddef2a6ac4788b4bad5481b2916558d2ce1bef04: interrupted system call

followed ultimately by

tree 26ead64a6120bcc0eda123d5bdc07bd3c655ce51b29ad147864fad6e21f8913a not found in repository
github.com/restic/restic/internal/repository.(*Repository).LoadTree
github.com/restic/restic/internal/repository/repository.go:713
github.com/restic/restic/internal/restic.FindUsedBlobs
github.com/restic/restic/internal/restic/find.go:11
main.pruneRepository
github.com/restic/restic/cmd/restic/cmd_prune.go:191
main.runPrune
github.com/restic/restic/cmd/restic/cmd_prune.go:85
main.glob…func18
github.com/restic/restic/cmd/restic/cmd_prune.go:25
github.com/spf13/cobra.(*Command).execute
github.com/spf13/cobra@v0.0.3/command.go:762
github.com/spf13/cobra.(*Command).ExecuteC
github.com/spf13/cobra@v0.0.3/command.go:852
github.com/spf13/cobra.(*Command).Execute
github.com/spf13/cobra@v0.0.3/command.go:800
main.main
github.com/restic/restic/cmd/restic/main.go:86
runtime.main
runtime/proc.go:203
runtime.goexit
runtime/asm_amd64.s:1373

I am using

restic 0.9.6 (v0.9.6-137-gc542a509) compiled with go1.14 on linux/amd64

I originally reported this as a bug against the current Arch Linux release (0.9.6-2) which is where I first saw the issue. Based on some helpful feedback from one of the Arch packagers (Foxboron) I’m now convinced this is a go 1.14 issue. But I know next to nothing about go, so I’m not sure where to go from here.

Any advice would be appreciated. Thanks!

ProactiveServices · March 11, 2020, 8:30pm

Is there anything Samba-related in the syslog that might point to a problem with the transport?

underhillian · March 11, 2020, 8:59pm

@ProactiveServices:Yes, there is a long series of error messages (truncated here):

Mar 09 17:05:41 pleiades kernel: CIFS VFS: Send error in read = -4
Mar 09 17:05:41 pleiades kernel: CIFS VFS: \maia\pleiades_data Close interrupted close
Mar 09 17:05:41 pleiades kernel: CIFS VFS: \maia\pleiades_data Close interrupted close
Mar 09 17:05:41 pleiades kernel: CIFS VFS: \maia\pleiades_data Close interrupted close
Mar 09 17:05:41 pleiades kernel: CIFS VFS: \maia\pleiades_data Close interrupted close
Mar 09 17:05:42 pleiades kernel: CIFS VFS: Send error in read = -4
Mar 09 17:05:42 pleiades kernel: CIFS VFS: Send error in read = -4
Mar 09 17:05:42 pleiades kernel: CIFS VFS: \maia\pleiades_data Close interrupted close
Mar 09 17:05:42 pleiades kernel: CIFS VFS: \maia\pleiades_data Close interrupted close
Mar 09 17:05:42 pleiades kernel: CIFS VFS: \maia\pleiades_data Close interrupted close
Mar 09 17:05:43 pleiades kernel: CIFS VFS: Send error in read = -4
Mar 09 17:05:43 pleiades kernel: CIFS VFS: Send error in read = -4
Mar 09 17:05:43 pleiades kernel: CIFS VFS: \maia\pleiades_data Close interrupted close
Mar 09 17:05:43 pleiades kernel: CIFS VFS: Send error in read = -4
Mar 09 17:05:44 pleiades kernel: CIFS VFS: \maia\pleiades_data Close interrupted close
Mar 09 17:05:44 pleiades kernel: CIFS VFS: Send error in read = -4
Mar 09 17:05:44 pleiades kernel: CIFS VFS: \maia\pleiades_data Close interrupted close
Mar 09 17:05:45 pleiades kernel: CIFS VFS: Send error in read = -4
…

ProactiveServices · March 11, 2020, 10:27pm

Whilst you mentioned it doesn’t happen with a different version of restic, this looks to me like a problem with the CIFS transport. If there are interruptions in the system calls then any program (go-based or otherwise) will fail. It could be that there have been Go changes that use I/O differently, perhaps making use of features that weren’t previously used, but these show up underlying problems with your CIFS connection. I’d check the logs on the other end and also look at the network reliability, and see if you can pass more conservative flags to the mount, regarding caching, timeouts etc.

Perhaps someone with a bit more experience might like to chime in.

rawtaz · March 11, 2020, 10:39pm

Wait, what? If you have a Linux repository server, why are you even considering using CIFS for the connection? Install https://github.com/restic/rest-server on the Linux server and shoot your backups there instead (it’ll be over HTTP(S)).

MichaelEischer · March 11, 2020, 11:06pm

This looks like a side effect of asynchronous preemptions in go 1.14 to me. The release notes state the following:

This means that programs that use packages like syscall or golang.org/x/sys/unix will see more slow system calls fail with EINTR errors. Those programs will have to handle those errors in some way, most likely looping to try the system call again.

underhillian · March 12, 2020, 2:56am

Thanks. I take your point. That said, the short answer to your question until now has been “if it ain’t broke, don’t fix it .” A somewhat longer answer is that the server in question provides a number of functions for Windows clients (in addition to Linux clients) and the simplest overall solution seemed to be to provide everything needed over CIFS/Samba. Adding and maintaining another piece of software just so restic will work is certainly not the end of the world (and based on @MichaelEischer’s comment below looks like the most practical approach for me) but it adds a small incremental level of complexity on the server that hasn’t been necessary in the past.

underhillian · March 12, 2020, 3:12am

Thanks very much. Granted that my understanding of the details is limited, this make perfect sense.

Since even my aspirational level of competence with go doesn’t extend to addressing this , and since the go folks seem to have anticipated and accepted the potential consequences of the change, I think the best approach for me is to adopt @rawtaz’s suggestion and get out of the CIFS business (at least as far as restic is concerned) by installing rest-server. I’ll give this a try and (assuming I’m successful) mark this as complete.

rawtaz · March 12, 2020, 9:32am

Just let us know if you have problems getting the REST server running!

MichaelEischer · March 12, 2020, 8:46pm

Can you try whether setting the environment variable GODEBUG=asyncpreemptoff=1 avoids these interrupted system call errors?

In case restic is not compatible with the assumptions made by go, then this is a bug in restic. I’d suggest to open an issue on github.

underhillian · March 13, 2020, 4:26pm

Aha! This works perfectly. No errors and the prune completes successfully.

I will do this. Thanks again for your help.

underhillian · March 13, 2020, 4:47pm

Since you offer…

After some experimentation, it appears to me that the REST server is very restrictive as to repo location in that it works only with repos located in its “data directory.” Specifically, it fails on both a) repos located in subdirectories below the data directory and b) repos symlinked to the data directory.

Am I missing something or doing something obviously wrong?

If not, then to use the REST server I would either have to restructure pretty much my servers entire data hierarchy, or do something ugly like running separate instances of the REST server for every repo location (and there are a dozen or so, basically one for each client and a shared area.). Neither of these options is very attractive.

I gather from the restic documentation that the REST backend is broadly preferred over sftp, but the greater flexibility (I think) sftp will provide is now pushing me in that direction. Any thoughts?

Thanks.

cdhowie · March 13, 2020, 5:22pm

What the heck? Is it normal policy for Golang to introduce breaking changes in minor releases!?

rawtaz · March 13, 2020, 6:12pm

I’m not sure what you mean by rest-server’s “data directory” - can you clarify/elaborate? Perhaps if you give more concrete examples of what you tried to do, that’d be helpful in understanding.

I just run it with --path /data2/restic where /data2 is a mounted volume (LVM) and restic is a subfolder in there, and --private-repos. The users’ repositories are then created automatically as e.g. /data2/restic/firstname.lastname when they use the firstname.lastname account in .htaccess.

In /data2/restic/ I also have some symlinks, i.e. where the firstname.lastname folder is a symlink to /data/restic/firstname.lastname.

All of this works fine. I’m not sure if that helps you in any way, but I fail to see exactly what you mean and I also fail to see in what way it’s not working for you.

What greater flexibility do you see with SFTP compared to rest-server, for your use cases? I’ve used both and nowadays I use the REST backend (with rest-server) for both customers/external clients and my own repository. Surely you can tighten down SFTP a bit more with the features of SSH, but that’s about what I’d see as the main thing with SFTP (and it’s something I don’t feel I need, so I stay ignorant of it

underhillian · March 13, 2020, 7:21pm

Oops. I had a typographical mismatch between my data directory (that’s the term rest-serv --help uses for the argument to the --path flag) and where I actually put the symlinks. Must have gone over it a dozen times and never saw it until I started copy-and-pasting the error output into a reply to you.

Apologies for wasting your time. If it’s any consolation, I probably would have just given up if you hadn’t confirmed that symlinked repos work just fine.

Anyway, the REST server is working now and I’m back on track.

Thanks again for your help.

rawtaz · March 13, 2020, 7:47pm

This is why I’ve started putting on my rubber duck suit right in the morning nowadays, having it on all the time is clearly the most effective method to help people.

That’s fine, I’m glad your experience from this is that you should never give up as there’s usually a solution around the corner

Really nice!

MichaelEischer · March 14, 2020, 7:52pm

will see more slow system calls fail with EINTR errors

Go configures signal handlers to restart syscalls if possible. The standard library also retries syscalls when necessary. That is there should only be issues when directly calling low-level syscalls and in that case one should just implement things properly. Or in other words: Code that has problems with interrupted syscalls was already broken, the change just makes it much more obvious. In that regard they didn’t break correct code.

But now on to the prune errors: The error message seems to point to an os.Open call (via fs.Open) in the Load function of the local backend. So it looks like a Go standard library call fails. However, when reading the manpage for signal (man 7 signal) it states that the open syscall, that is called underneath, is always restarted when using SA_RESTART as is done by Go. So this might actually be a bug in the linux kernel…

@underhillian just to get a feeling for how slow the syscalls are: How high is the network latency (i.e. ping times) between your backup host and client?

I’ve also created a small workaround for the interrupted syscall problem at Commits · MichaelEischer/restic · GitHub . The code is based on restic v0.9.6 with an additional commit that adds a loop around the call to open which I assume to be the culprit. @underhillian would it be possible for you to test the workaround?

And I think that it would be a good idea to move the debugging to a github issue (For me this post feels way to low-level level for a forum post)

underhillian · March 14, 2020, 10:29pm

– maia ping statistics —
10 packets transmitted, 10 received, 0% packet loss, time 9049ms
rtt min/avg/max/mdev = 0.816/0.998/1.133/0.117 ms

Sure.

I’m not familiar with working with patched code. If you could tell me in 10 words or less how to clone your repo (or similarly download your commit) that would help. If not, don’t worry about it…I’ll figure it out (and it’s something I should know how to do anyway).

MichaelEischer · March 15, 2020, 11:10am

If you can use git, then follow the following steps. Otherwise you can also download the code as a zip archive from Github via the ‘clone or download’ button on https://github.com/MichaelEischer/restic/tree/workaround-eintr .

You basically clone the repository first using git clone https://github.com/MichaelEischer/restic.git and then switch to the branch git checkout workaround-eintr. You could also combine both into a single command: git clone -b workaround-eintr https://github.com/MichaelEischer/restic.git. Other options would be to add my repository as an additional remote.

Even though it’s rather common to just run random code from the internet (possibly even as root), you might want to briefly check that the branch just contains the code it should. You can for example use git log --graph -p to take a look at the changes made. And to ensure that the commit is properly based on restic v0.9.6, you can compare the hash of the commit that increased the restic version with the official restic release https://github.com/restic/restic/releases/tag/v0.9.6 (the commit hash is shown in the left column, just below the tag name).

underhillian · March 15, 2020, 6:21pm

I believe I did this correctly. I used “git clone” followed by “git checkout workaround-eintr” as you described. The version of restic (per “restic version”) which resulted was

restic 0.9.6 (v0.9.5-99-g611318d7) compiled with go1.14 on linux/amd64

which is not quite what I expected (so if this seems wrong to you please let me know).

Unfortunately running this version still produces a series of “interrupted system call” errors, for example

Load(<data/044e0ab067>, 591, 4314150) returned error, retrying after 582.280027ms: read /mnt/nas/redacted/reponame/data/04/044e0ab067e74bbb52749ff1a06b999704afc452ffeca5df935c922977838034: interrupted system call

Just to check that nothing else had changed, I reran this (patched) version with GODEBUG=asyncpreemptoff=1 as you suggested previously. No errors.

If there’s something else you would like me to try, happy to do so.