Restic stuck on Mac

Hi guys,

so I am running restic on a Mac for some testing. I am trying to copy most of the files on the entire system with some exclusions to avoid problematic folders using the following command:
restic backup / --no-cache --exclude=/** /.cleverfiles --exclude=/System/Volumes/VM --exclude=/System/Volumes/Data --exclude=/** /OneDriveTemp --exclude=/** /OneDrive

The executable has been granted Full Disk Access.

The first time I ran it, it went just fine, took just over an hour and resulted in a repo size of about 77GB (it did have some warnings of inaccessible files, but that’s not a problem).

Then I ran it again before the weekend, and now it’s still going, but not progressing. By 18:30 on Friday evening it had reached 89.04% and since then, it has just been stuck there. The number of files or the size hasn’t changed at all either. I’ve configure it so that restic produces a progress output every 2 seconds and it is still happily logging those, but no change in any progress values.

Is there anything I can try to do to determine what restic is stuck on? Could the lsof output for the pid be useful e.g.?

Any advice or tips would be appreciated…

Just an update: so I let the process continue to run to see what would happen. At about 79.5 hours, I observed the following output after which the restic command progressed happily and finished at about 81.25 hours:

[79:26:45] 89.04%  1109514 files 91.661 GiB, total 1183423 files 102.950 GiB, 3 errors ETA 9:47:02
[79:26:47] 89.04%  1109514 files 91.661 GiB, total 1183423 files 102.950 GiB, 3 errors ETA 9:47:02
[79:26:49] 89.04%  1109514 files 91.661 GiB, total 1183423 files 102.950 GiB, 3 errors ETA 9:47:03
[79:26:51] 89.04%  1109514 files 91.661 GiB, total 1183423 files 102.950 GiB, 3 errors ETA 9:47:03
[79:26:54] 89.04%  1109514 files 91.661 GiB, total 1183423 files 102.950 GiB, 3 errors ETA 9:47:03
[79:26:56] 89.04%  1109514 files 91.661 GiB, total 1183423 files 102.950 GiB, 3 errors ETA 9:47:03
[79:26:58] 89.04%  1109514 files 91.661 GiB, total 1183423 files 102.950 GiB, 3 errors ETA 9:47:04
Save(<lock/cf4a7f1456>) returned error, retrying after 552.330144ms: client.PutObject: Put "": read tcp> read: operation timed out
[79:27:00] 89.04%  1109517 files 91.661 GiB, total 1183423 files 102.950 GiB, 3 errors ETA 9:47:04
... represents the IP address of the ceph storage cluster which is being backed up to. As can be seen, immediately after the error in the log, the number of files started to increment again. They had been stuck on 1109514 since [2:41:52].

In the time it was stuck, the lsof output for the restic pid always contained this line:

restic 39331 root    9u    IPv4 0x78322971858d9593      0t0                 TCP> (ESTABLISHED)

I also observed some DNS problems just a few seconds before that in the logs of the Java wrapper which is invoking restic and also sending some status updates to a server in the same data center as the ceph cluster: nodename nor servname provided, or not known

where represents the server in the same data center as the ceph storage cluster.

Quite strange behaviour. Any thoughts? Was restic stuck for about 77 hours on one network call?

Sure looks like it to me. Your analysis is as good as mine. The retry message suggests it only waits about 552 ms before trying that save again, and since theres no subsequent message of that type one can assume that the retry succeeded.

I guess that the Save() call was started before the delay, and when restic came back up after the delay, the connection that Save() started was lost since long and therefore you got that error message.

Can you reproduce the issue? Heck, for all we know it could have been cosmic rays, if it was a one time thing and it works for you on subsequent tries.

Thanks @rawtaz, yes it looks like the retry worked immediately.
We have seen a very long delay on a couple of occasions, specifically on 2 different Mac machines, but since this last one I have not been able to reproduce such a very long time of being stuck. But it seems to occasionally get stuck for a good while (a number of minutes to an hour or so) without the progress values incrementing at all, so maybe not quite as rare an occurrence as cosmic rays :grinning_face_with_smiling_eyes:

I can try to get some more figures, but not sure how one would try to get to the bottom of what is causing this… In my rather uninformed view, restic should not wait indefinitely for any network or filesystem calls but timeout after a reasonable amount of time. As I mention in this issue which is probably unrelated but with similar symptoms, I realize it may not be immediately or easily possible to implement such timeouts, but I would still think it would be worth considering and investigating…

Cosmic rays are actually far more common than one would think, check out Veritasium’s video about it on Youtube :slight_smile:

I would build restic with debugging enabled, run it that way and then examine the debug log to see if I could spot what’s going on around when it starts hanging, see: Participating — restic 0.12.1 documentation

Thanks, I’m busy running a backup with debugging. What would I look for in the debug log file?

So far I can see a ton of

body not drained, 0 bytes not read

on the terminal output, is that normal?

/me dunno, I rarely dabble in debug logs. I’d just look at the log to see where it starts looking suspicious. Looking at the source code where body not drained is printed it does seem a bit fishy to me that you’d get that message (seems to me that the 0 should be non-zero).

I see. Yeah, the log file is 5.2 GB in size, so not so easy to wade through. Although I couldn’t stare at the output for 4.5 hours to monitor the progress, I believe there were 1 or 2 times during the backup that it was stuck for a while without the stats incrementing, but so far I don’t see much more than that somewhat fishy message I mentioned before. This is how it appears in the debug log:

2021/10/21 15:33:12 debug/round_tripper_debug.go:49     debug.(*eofDetectReader).Close  214     body not drained, 0 bytes not read: Close()*eofDetectReader).Close*Backend).Load*RetryBackend).Load.func1*RetryBackend).retry*RetryBackend).Load*Repository).LoadBlob*Repository).LoadTree*Archiver).loadSubtree*Archiver).Save*Archiver).SaveTree*Archiver).Snapshot.func1*Tomb).run

There are 480362 of these in the log.

Yeah it’s a big file alright. It would be nice to see a bit more context but I can’t say there’s anything in particular I’m looking for.

Can you reproduce this at will? If yes, can you possibly try to run it with Amazon S3 to see if it’s isolated to the repository server you’re currently using? Also, please provide the output of restic version as well as what OS version and architecture you’re on.

I understand. The extremely long delay I cannot reproduce at will. I think the case of the stats not incrementing for quite some time occasionally (say for a good number of minutes) happens quite regularly. The fishy log message as shown above I can reproduce consistently, there’s always a whole heap of them. I’ll see what I can do with a different backend maybe…

Output from restic version is:

restic 0.12.1 (v0.12.1-115-gfb5d9345) compiled with go1.17.2 on darwin/amd64

Architecture is x86_64

OS is:

ProductName:	macOS
ProductVersion:	11.0.1
Darwin Kernel Version 20.1.0
1 Like

Is that really the first final release of macOS Big Sur? If so you should probably update it to the latest which I think is 11.6 :slight_smile:

How many directories does the backup include?

That warning is printed when trying to access the tree metadata of the parent snapshot. It also means that using the cache in restic could speed up the backup a lot, just about anything will be faster than issuing a half million requests to download some file part. The warning itself shouldn’t be a problem, after all there are “0 bytes” left to read.

Ah, I see, thanks for clarifying @MichaelEischer! Yes, it is a large set of directories, not sure exactly how many, but it’s the whole directory structure on the Mac ("/"), with a few exclusions.
I will need to rethink the cache situation a little bit, at the moment it is disabled, but will ponder on that a bit.
Good to know the warning itself is not a problem. Why is it not finding an eof though? And if this is normal, maybe the message should not be printed?

In any case, my main concern is still the 77 hour delay waiting on a network call. Even if it is 1 hour or 30 mins waiting on a single network call, this should not happen in my view, it should somehow just timeout after a more reasonable amount of time…

The debug log is not used that often and the combination with --no-cache is even rarer. What happens here is that restic e.g. requests 50 bytes and also only reads exactly 50 bytes. That way it won’t trigger the end-of-file state.

There are currently no timeouts for the actual download step, the problem there is setting a value which works for every one. It would probably have to be something like requiring a minimum download speed.