Restic stuck on Mac

ctonsing · October 18, 2021, 8:21am

Hi guys,

so I am running restic on a Mac for some testing. I am trying to copy most of the files on the entire system with some exclusions to avoid problematic folders using the following command:
restic backup / --no-cache --exclude=/** /.cleverfiles --exclude=/System/Volumes/VM --exclude=/System/Volumes/Data --exclude=/** /OneDriveTemp --exclude=/** /OneDrive

The executable has been granted Full Disk Access.

The first time I ran it, it went just fine, took just over an hour and resulted in a repo size of about 77GB (it did have some warnings of inaccessible files, but that’s not a problem).

Then I ran it again before the weekend, and now it’s still going, but not progressing. By 18:30 on Friday evening it had reached 89.04% and since then, it has just been stuck there. The number of files or the size hasn’t changed at all either. I’ve configure it so that restic produces a progress output every 2 seconds and it is still happily logging those, but no change in any progress values.

Is there anything I can try to do to determine what restic is stuck on? Could the lsof output for the pid be useful e.g.?

Any advice or tips would be appreciated…

ctonsing · October 19, 2021, 8:07am

Just an update: so I let the process continue to run to see what would happen. At about 79.5 hours, I observed the following output after which the restic command progressed happily and finished at about 81.25 hours:

...
[79:26:45] 89.04%  1109514 files 91.661 GiB, total 1183423 files 102.950 GiB, 3 errors ETA 9:47:02
[79:26:47] 89.04%  1109514 files 91.661 GiB, total 1183423 files 102.950 GiB, 3 errors ETA 9:47:02
[79:26:49] 89.04%  1109514 files 91.661 GiB, total 1183423 files 102.950 GiB, 3 errors ETA 9:47:03
[79:26:51] 89.04%  1109514 files 91.661 GiB, total 1183423 files 102.950 GiB, 3 errors ETA 9:47:03
[79:26:54] 89.04%  1109514 files 91.661 GiB, total 1183423 files 102.950 GiB, 3 errors ETA 9:47:03
[79:26:56] 89.04%  1109514 files 91.661 GiB, total 1183423 files 102.950 GiB, 3 errors ETA 9:47:03
[79:26:58] 89.04%  1109514 files 91.661 GiB, total 1183423 files 102.950 GiB, 3 errors ETA 9:47:04
Save(<lock/cf4a7f1456>) returned error, retrying after 552.330144ms: client.PutObject: Put "https://domain.name.of.ceph.s3.interface.com/bucket/path/to/repo/locks/cf4a7f145652b484432b2b9986d2f1ad9a98f247424fbb721eec81fc67158b6c": read tcp 192.168.0.184:56135->1.2.3.4:443: read: operation timed out
[79:27:00] 89.04%  1109517 files 91.661 GiB, total 1183423 files 102.950 GiB, 3 errors ETA 9:47:04
...

1.2.3.4 represents the IP address of the ceph storage cluster which is being backed up to. As can be seen, immediately after the error in the log, the number of files started to increment again. They had been stuck on 1109514 since [2:41:52].

In the time it was stuck, the lsof output for the restic pid always contained this line:

restic 39331 root    9u    IPv4 0x78322971858d9593      0t0                 TCP 192.168.0.184:56135->1.2.3.4:https (ESTABLISHED)

I also observed some DNS problems just a few seconds before that in the logs of the Java wrapper which is invoking restic and also sending some status updates to a server in the same data center as the ceph cluster:

java.net.UnknownHostException: domain.name.of.server.com: nodename nor servname provided, or not known

where domain.name.of.server.com represents the server in the same data center as the ceph storage cluster.

Quite strange behaviour. Any thoughts? Was restic stuck for about 77 hours on one network call?

rawtaz · October 20, 2021, 6:50pm

Sure looks like it to me. Your analysis is as good as mine. The retry message suggests it only waits about 552 ms before trying that save again, and since theres no subsequent message of that type one can assume that the retry succeeded.

I guess that the Save() call was started before the delay, and when restic came back up after the delay, the connection that Save() started was lost since long and therefore you got that error message.

Can you reproduce the issue? Heck, for all we know it could have been cosmic rays, if it was a one time thing and it works for you on subsequent tries.

ctonsing · October 21, 2021, 7:55am

Thanks @rawtaz, yes it looks like the retry worked immediately.
We have seen a very long delay on a couple of occasions, specifically on 2 different Mac machines, but since this last one I have not been able to reproduce such a very long time of being stuck. But it seems to occasionally get stuck for a good while (a number of minutes to an hour or so) without the progress values incrementing at all, so maybe not quite as rare an occurrence as cosmic rays

I can try to get some more figures, but not sure how one would try to get to the bottom of what is causing this… In my rather uninformed view, restic should not wait indefinitely for any network or filesystem calls but timeout after a reasonable amount of time. As I mention in this issue which is probably unrelated but with similar symptoms, I realize it may not be immediately or easily possible to implement such timeouts, but I would still think it would be worth considering and investigating…

rawtaz · October 21, 2021, 10:00am

Cosmic rays are actually far more common than one would think, check out Veritasium’s video about it on Youtube

I would build restic with debugging enabled, run it that way and then examine the debug log to see if I could spot what’s going on around when it starts hanging, see: Participating — restic 0.12.1 documentation

ctonsing · October 21, 2021, 2:36pm

Thanks, I’m busy running a backup with debugging. What would I look for in the debug log file?

So far I can see a ton of

body not drained, 0 bytes not read

on the terminal output, is that normal?

rawtaz · October 21, 2021, 9:18pm

/me dunno, I rarely dabble in debug logs. I’d just look at the log to see where it starts looking suspicious. Looking at the source code where body not drained is printed it does seem a bit fishy to me that you’d get that message (seems to me that the 0 should be non-zero).

ctonsing · October 22, 2021, 8:06am

I see. Yeah, the log file is 5.2 GB in size, so not so easy to wade through. Although I couldn’t stare at the output for 4.5 hours to monitor the progress, I believe there were 1 or 2 times during the backup that it was stuck for a while without the stats incrementing, but so far I don’t see much more than that somewhat fishy message I mentioned before. This is how it appears in the debug log:

2021/10/21 15:33:12 debug/round_tripper_debug.go:49     debug.(*eofDetectReader).Close  214     body not drained, 0 bytes not read: Close()
github.com/restic/restic/internal/debug.(*eofDetectReader).Close
        github.com/restic/restic/internal/debug/round_tripper_debug.go:49
github.com/restic/restic/internal/backend/s3.wrapReader.Close
        github.com/restic/restic/internal/backend/s3/s3.go:311
github.com/restic/restic/internal/backend.DefaultLoad
        github.com/restic/restic/internal/backend/utils.go:58
github.com/restic/restic/internal/backend/s3.(*Backend).Load
        github.com/restic/restic/internal/backend/s3/s3.go:319
github.com/restic/restic/internal/backend.(*RetryBackend).Load.func1
        github.com/restic/restic/internal/backend/backend_retry.go:89
github.com/cenkalti/backoff/v4.RetryNotifyWithTimer
        github.com/cenkalti/backoff/v4@v4.1.1/retry.go:55
github.com/cenkalti/backoff/v4.RetryNotify
        github.com/cenkalti/backoff/v4@v4.1.1/retry.go:34
github.com/restic/restic/internal/backend.(*RetryBackend).retry
        github.com/restic/restic/internal/backend/backend_retry.go:46
github.com/restic/restic/internal/backend.(*RetryBackend).Load
        github.com/restic/restic/internal/backend/backend_retry.go:87
github.com/restic/restic/internal/restic.ReadAt
        github.com/restic/restic/internal/restic/readerat.go:32
github.com/restic/restic/internal/repository.(*Repository).LoadBlob
        github.com/restic/restic/internal/repository/repository.go:197
github.com/restic/restic/internal/repository.(*Repository).LoadTree
        github.com/restic/restic/internal/repository/repository.go:709
github.com/restic/restic/internal/archiver.(*Archiver).loadSubtree
        github.com/restic/restic/internal/archiver/archiver.go:211
github.com/restic/restic/internal/archiver.(*Archiver).Save
        github.com/restic/restic/internal/archiver/archiver.go:461
github.com/restic/restic/internal/archiver.(*Archiver).SaveTree
        github.com/restic/restic/internal/archiver/archiver.go:569
github.com/restic/restic/internal/archiver.(*Archiver).Snapshot.func1
        github.com/restic/restic/internal/archiver/archiver.go:800
gopkg.in/tomb%2ev2.(*Tomb).run
        gopkg.in/tomb.v2@v2.0.0-20161208151619-d5d1b5820637/tomb.go:163
runtime.goexit
        runtime/asm_amd64.s:1581

There are 480362 of these in the log.

rawtaz · October 22, 2021, 10:37am

Yeah it’s a big file alright. It would be nice to see a bit more context but I can’t say there’s anything in particular I’m looking for.

Can you reproduce this at will? If yes, can you possibly try to run it with Amazon S3 to see if it’s isolated to the repository server you’re currently using? Also, please provide the output of restic version as well as what OS version and architecture you’re on.

ctonsing · October 22, 2021, 12:35pm

I understand. The extremely long delay I cannot reproduce at will. I think the case of the stats not incrementing for quite some time occasionally (say for a good number of minutes) happens quite regularly. The fishy log message as shown above I can reproduce consistently, there’s always a whole heap of them. I’ll see what I can do with a different backend maybe…

Output from restic version is:

restic 0.12.1 (v0.12.1-115-gfb5d9345) compiled with go1.17.2 on darwin/amd64

Architecture is x86_64

OS is:

ProductName:	macOS
ProductVersion:	11.0.1
Darwin Kernel Version 20.1.0

rawtaz · October 22, 2021, 12:41pm

Is that really the first final release of macOS Big Sur? If so you should probably update it to the latest which I think is 11.6

MichaelEischer · October 23, 2021, 4:51pm

How many directories does the backup include?

That warning is printed when trying to access the tree metadata of the parent snapshot. It also means that using the cache in restic could speed up the backup a lot, just about anything will be faster than issuing a half million requests to download some file part. The warning itself shouldn’t be a problem, after all there are “0 bytes” left to read.

ctonsing · October 25, 2021, 7:45am

Ah, I see, thanks for clarifying @MichaelEischer! Yes, it is a large set of directories, not sure exactly how many, but it’s the whole directory structure on the Mac ("/"), with a few exclusions.
I will need to rethink the cache situation a little bit, at the moment it is disabled, but will ponder on that a bit.
Good to know the warning itself is not a problem. Why is it not finding an eof though? And if this is normal, maybe the message should not be printed?

In any case, my main concern is still the 77 hour delay waiting on a network call. Even if it is 1 hour or 30 mins waiting on a single network call, this should not happen in my view, it should somehow just timeout after a more reasonable amount of time…

MichaelEischer · October 25, 2021, 8:40pm

The debug log is not used that often and the combination with --no-cache is even rarer. What happens here is that restic e.g. requests 50 bytes and also only reads exactly 50 bytes. That way it won’t trigger the end-of-file state.

There are currently no timeouts for the actual download step, the problem there is setting a value which works for every one. It would probably have to be something like requiring a minimum download speed.