Prune fails on CIFS repo using go 1.14 build

MichaelEischer · March 17, 2020, 9:43pm

Hmm, that looks like this affects more system calls than I expected. Can you check whether the log contains ‘: open /mnt/nas’? (the important part is the word “open”, the prefix and suffix is just there to get rid of unexpected matches, that is if the repository path really starts with ‘/mnt/nas’). If the pattern doesn’t match then my workaround did work, otherwise it didn’t.

underhillian · March 18, 2020, 2:11am

I’m confused again (it doesn’t take much!). I don’t see anything of that form. Possibly I don’t understand what you mean by “the log”. In any case:

In the restic output, there is no mention of an “:open”…only the usual restic output and the “interrupted system call” errors.

In the client syslog, there is confirmation that the remote share was mounted successfully

Mar 17 21:11:41 pleiades systemd[1]: mnt-nas-pleiades_data.automount: Got automount request for /mnt/nas/pleiades_data, triggered by 14222 (restic)
Mar 17 21:11:41 pleiades systemd[1]: Mounting /mnt/nas/pleiades_data…
Mar 17 21:11:41 pleiades kernel: CIFS: Attempting to mount //maia/pleiades_data
Mar 17 21:11:41 pleiades systemd[1]: Mounted /mnt/nas/pleiades_data.

followed by a long series of error messages similar to what I reported before, e.g.

Mar 17 21:11:43 pleiades kernel: CIFS VFS: \maia\pleiades_data Close interrupted close
Mar 17 21:11:43 pleiades kernel: CIFS VFS: Send error in read = -4

In the syslog on the server, there is nothing.

Have I overlooked a different log location where you would expect to see the ‘: open /mnt/nas’? Or possibly I need a --verbose=? flag to restic to get this in the restic output?

UPDATE: Tried this again with --verbose=4 added to restic command; no difference.

MichaelEischer · March 18, 2020, 7:27pm

Sorry for the confusion. I meant the output produced by a prune run. In your first post on the topic you had an example of an “interrupted system call” error. That line also contains the the word “open” right after "retrying after 720.254544ms: ". So what I wanted to ask was whether the “interrupted system call” error lines still contain the word “open” right before the path name or whether they now always contain something else like “read”. My hope is that my workaround solved the problem for “open”.

underhillian · March 18, 2020, 10:04pm

Got it. Thanks.

Indeed there are now no “open” errors!

There is still a series of “read” errors, for example:

Load(<data/2e9db0642e>, 591, 4758136) returned error, retrying after 552.330144ms: read /mnt/nas/redacted/reponame/data/2e/2e9db0642e0fb67b959aa1d91c0d70daa8331ad246c5eeb8582ba2a14f24680f: interrupted system call

and exactly one each of three other error types:

List(data) returned error, retrying after 282.818509ms: lstat /mnt/nas/redacted/reponame/data/64: interrupted system call
List(data) returned error, retrying after 492.389441ms: readdirent: interrupted system call
Save(<data/f0f5102554>) returned error, retrying after 552.330144ms: chmod /mnt/nas/redacted/reponame/data/f0/f0f51025542c0287943ef3816e642586be46ae10dc9efbcfa7b305d9e093dbd4: interrupted system call

MichaelEischer · March 18, 2020, 10:30pm

Thanks for testing that quickly. The last three error types confirm a suspicion I’ve had for some time now: For CIFS the Linux kernel seems to violate the documented behavior of syscalls when interacting with signals. Neither lstat, readdir or chmod even specify EINTR (“interrupted system call”) as a valid error code. Go also relies on the assumption that the kernel properly restarts syscalls when told to do so. But that is obviously not the case for CIFS.

That said it’s rather pointless to work around this behavior in restic, so I guess for the short-term you either have to use the rest server or a similar backend or set the environment variable. For a proper solution the linux kernel would require fixing. On a related note, which kernel version do you use on the host that runs restic?

underhillian · March 19, 2020, 2:55am

5.5.9

I’ve already moved over to the rest server, which is working fine. My setup with restic over CIFS was more the accidental result of a long evolution than a carefully considered solution, and in hindsight it wasn’t the best approach. As long as it worked, however, there was no compelling reason to change; now that’s it’s broken, there’s equally no compelling reason for me to stay with it given that better options are available. I’m more concerned about the user who’s using restic to (for example) back up to an external drive hanging off a consumer router or even a consumer NAS using CIFS who will run into this issue once go 1.14 is used to produce the mainstream restic builds and who can’t easily make the same change. Hopefully this thread will lead him or her to your analysis and workaround.

With that said, and with thanks for all your help, I think I’'ll call this issue closed for me. It seems like the next step would be to report a bug against the kernel, and (while I follow the broad strokes of your analysis) I definitely don’t have the understanding I’d need to do that intelligently.

764287 · March 19, 2020, 8:25am

Do you think we should add a note to the documentation that using CIFS is not recommended?

MichaelEischer · March 21, 2020, 1:31pm

Sounds like a good idea. That leaves the question on what the note should describe. I’d like to leave out technical details as far as possible (at least in the note itself, the commit making the change should mention the technical details).

I’m currently thinking about something along the lines of: “On Linux, storing the backup repository on a CIFS (SMB) share is not recommended due to compatibility problems. Either use another backend or set the environment variable GODEBUG to asyncpreemptoff=1.”

I’m not sure how interested the average user would be in the exact compatibility problems. It might be useful to include a link to a Github issue.

Speaking of the latter, I’ve opened an issue at https://github.com/restic/restic/issues/2659 .

764287 · March 26, 2020, 8:28am

Sounds good to me. I agree that the note shouldn’t include too much technical information besides the link to the Github issue.

Will you open a PR with the proposed note, or do you want me to do that?

MichaelEischer · March 26, 2020, 9:01pm

I’ve opened https://github.com/restic/restic/pull/2669 .

sdragnev · September 21, 2020, 7:59pm

Sorry to resurrect this thread but did anyone report this to the linux kernel bug tracker? If this is a kernel bug, it seems like it’s best fixed there rather than discouraging people from using CIFS altogether.

For the record I ran into this issue backing up a Linux system on a NAS drive via CIFS.

I ran a few searches at https://bugzilla.kernel.org but found nothing.

MichaelEischer · September 27, 2020, 9:31pm

I don’t think that anyone reported this to the linux kernel bug tracker. Could you try whether Go 1.15 helps with these failures, e.g. by using restic 0.9.6? According to the changelog of Go 1.15 syscalls are now automatically retried.

sdragnev · September 28, 2020, 12:22pm

0.9.6 was the one that was failing on a 5.8 version kernel.
I actually moved away from that to kernel 5.3 and now restic 0.10.0 and have no problems. Maybe it was an issue with the 5.8 kernel.

Maybe I’ll spin up a VM with 5.8 to test this again.

MichaelEischer · October 2, 2020, 8:34pm

See also https://github.com/restic/restic/issues/2968 for some new news on the CIFS problem.

daneroo · April 7, 2021, 7:51pm

Is there any chance that this issue is solved with go1.16 (as has been suggested would happen)?

I saw that go released 1.16 (2021-02-16) just a few days after restic 0.12.0 was released (2021-02-14).
When a new release of restic is made, will it autimacillay be using the latest go1.16?

underhillian · April 8, 2021, 2:22pm

I did some testing with a go 1.16 build; see restic fails on repo mounted via CIFS/samba on Linux using go 1.14 build · Issue #2659 · restic/restic · GitHub for the results.

tl;dr No.

daneroo · April 8, 2021, 7:19pm

Thanks for the reply and tests!
I am using restic in docker with a CIFS mount, but in that case I think my host (MacOS) is doing the actual CIFS mounting, and have not had any issues.

I will try same on Linux host to try to replicate.
I’d love to help, if there’s anything I can do…