Low performance after first backup, and restic seems to spend a lot of time doing nothing

Hello everyone,

Running restic v.0.9.4 on Mac OS/X HighSierra in a MacPro 2013 with quad-xeon CPU and 64GB of ECC RAM.

EDIT: Backend is ‘plain’ (non-encrypted) Google Drive with an unlimited GSuite business account.

I’m using restic to backup a 33M files / 23 TiB directory tree. After the initial backup having taken over a month (long story, thanks @fd0 for all the assistance on restic’s IRC channel during that time), I’m now taking care of the daily backups, which are (to my great surprise) also taking a lot of time.

Case in point: yesterday’s backup took over 16h – and I can find no bottleneck on the machine or the internet connection: it’s a fast quad-core Xeon with 64GB of RAM and a ZFS 8-disk RAIDZ2 array which, according to top and zpool iostats, stays the entire time a long way from being saturated either CPU-, memory- or disk/network bandwidth-wise; and the internet link is 50Mbps and went the entire time under 10Mbps utilization.

What’s worse, according to the restic’s own output on stderr, is that restic seems to be spending minutes at a time doing absolutely nothing, see for example:

[3:15:24] 0.78%  103832 files 182.084 GiB, total 32916091 files 22.897 TiB, 0 errors ETA 416:06:18
[3:16:13] 0.78%  103832 files 182.084 GiB, total 32916091 files 22.897 TiB, 0 errors ETA 417:50:39

Adding to the fact that the “103832 files 182.084 GiB” doesn’t change during almost a full minute as shown above, and the ETA skyrockets almost 2 more hours from one line to the next (which shows restic’s own understanding of how little progress is being made), there’s the fact that a dtruss -p on the restic PID shows restic made absolutely no system calls during that same period:

# ps aux | grep restic
root            93731 [REDACTED] restic [REDACTED]

# dtruss -p 43991
SYSCALL(args)            = return
(and nothing else is shown) until I interrupt it with ^C, 
 so no system calls are being made)

top shows the restic process above using a lot of CPU, about 161% if memory serves (didn’t copy that screen).

Can someone help me understand what is going on, and more important, how I can fix this?

Thanks in advance,
– Durval Menezes.

Hey @durval and welcome to the forum! :slight_smile:

Maybe I missed it, but could you tell us what backend you are using for storing your backups?
I take it that you are backing up data that is connected to the Mac locally to an external backend, right?

Hi @moritzdietz,

Thanks! :slight_smile:

My bad. Backend is ‘plain’ (non-encrypted) Google Drive with an unlimited GSuite business account. I will also edit my post above to add this information.

Exactly. Local data is stored in the ZFS 8 disk raidz2 array I mentioned, and the external backend is as detailed above.

Thanks,
– Durval.

That sounds like you are using rclone in combination with restic, right?
If that’s the case then I think there are some tweaks on the rclone side of things you can do - since it handles the connection to your backend.

[quote=“moritzdietz, post:4, topic:1727”]
That sounds like you are using rclone in combination with restic, right?

[quote]
Yes, restic can only work with Google Drive via rclone as it has no direct support.

Thanks for your thoughts, but I do not think so: without any tuning, rclone copy can easily saturate my internet connection. Whatever is happening, is either in restic proper, or perhaps in its interaction with rclone (using the latter’s ‘server’ mode).

Hello everyone,

This has just happened again, and this time, thanks to @fd0 clueing me on IRC, I got a profile out of the execution:

curl http://localhost:6060/debug/pprof/profile?seconds=30 2 >&/dev/null > profile.pprof

I then opened the file with go tool pprof and pulled its top10:

go tool pprof profile.pprof 
Type: cpu
Time: May 16, 2019 at 8:47pm (-03)
Duration: 30s, Total samples = 10.21s (34.03%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 9.69s, 94.91% of 10.21s total
Dropped 32 nodes (cum <= 0.05s)
Showing top 10 nodes out of 23
      flat  flat%   sum%        cum   cum%
     4.78s 46.82% 46.82%      8.88s 86.97%  runtime.scanobject
     1.65s 16.16% 62.98%      1.84s 18.02%  runtime.findObject
     1.02s  9.99% 72.97%      1.02s  9.99%  runtime.nanotime
     0.58s  5.68% 78.65%      0.62s  6.07%  runtime.pageIndexOf
     0.57s  5.58% 84.23%      0.57s  5.58%  runtime.heapBits.bits
     0.39s  3.82% 88.05%      1.33s 13.03%  runtime.greyobject
     0.34s  3.33% 91.38%      0.35s  3.43%  runtime.heapBits.next
     0.15s  1.47% 92.85%      0.17s  1.67%  runtime.spanOf
     0.12s  1.18% 94.03%      0.12s  1.18%  runtime.markBits.isMarked
     0.09s  0.88% 94.91%      0.09s  0.88%  runtime.memmove
(pprof) 

So it seems a lot of “runtime” routines are basically all that’s running. I’m no go programmer, but I remember talking to @fd0 about Go’s Garbage Collector and that he mentioned runtime.scanobject to be part of the garbage collection. Unfortunately, I can’t make heads or tails of the rest.

Can anyone help me diagnose what’s going on?

Thanks in advance,
– Durval.