Consistent gap in upload bandwidth

Love the program, it’s impressively robust and flexible. I am having an “efficiency” issue rather than a show-stopper issue.

> restic version
restic 0.17.3 compiled with go1.23.3 on linux/amd64
sudo restic -r "$RESTIC_REPO" --cache-dir "$RESTIC_CACHE" backup -o local.connections=10 \
    --skip-if-unchanged --password-file "$RESTIC_PASSPATH" --verbose --one-file-system --tag systemd.timer \
    $BACKUP_EXCLUDES "${RESTIC_BACKUP_FOLDERS[@]}"

I am backing up to a remote Samba directory mounted locally and seeing an odd but consistent gap in upload bandwidth. More -o local.connections=10 makes the upload wave last longer, but does not seem to stop the gap from occurring.

See the gap in the photo below, there’s a repeating pattern of a drop off, a spike, and then another wave begins:

I am applying a speed limit, but this repeating feature happens with or without the speed limit. Rclone does not exhibit this gap issue (with or without a speed limit), which makes me think (especially since the time-between-gap is related to the -o local.connections=10) that it’s tied to restic uploading N chunks then waiting for a bit for some reason.

I care because the gap period greatly reduces my effective bandwidth (which is already low).

Any tips to get around this?

The local backend waits that the remote storage has confirmed that all uploaded data is safely written to disk. That would be the most likely cause of a gap in the uploads. Other than that there’s nothing in the backup code that deliberately waits on something.

What is the time scale in the image? Is one horizontal dot equivalent to one second? How high is the ping between the local host and the remote target?

If my guess from above is correct, then you can’t avoid it as it is necessary to guarantee the integrity of the uploaded data.

Thanks for replying!

The image timescale is 1 horizontal dot = 2 seconds. The ping is ~100 ms during the gaps and ~105 ms when the upload bit is occurring.

I used my router’s traffic monitor to get specific times, and the send is for 42 seconds and the gap is 22 seconds, which takes my 40 Mibps upload down to 26.25 Mibps effective (best case ignoring ramping).

~100 ms is on the order of spinning hard drive latencies so I found fio in this forum post ssd - How to check hard disk performance - Ask Ubuntu and applied it to see the latency for 16MiB chunks in sequential read/write (to match restic’s 16 MiB ~chunks~pack size).

(To use cd to the drive’s directory you want to test; you have to delete fio-tempfile.dat yourself afterwards)
Read:
sudo fio --name TEST --eta-newline=5s --filename=fio-tempfile.dat --rw=read --size=500m --io_size=10g --blocksize=16m --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
Write:
sudo fio --name TEST --eta-newline=5s --filename=fio-tempfile.dat --rw=write --size=500m --io_size=10g --blocksize=16m --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting

slat = submission latency, clat = completion latency
Local spinning drive (read avg / read stdev | write avg / write stdev - all in ms):
slat: 82.183 / 29.899 | 4.442 / 29.969
clat: 1489.66 / 781.72 | 2698.53 / 220.98
bw: 180 MiB/s | 176 MiB/s

Remote samba share mounted as a drive mounted (read avg / read stdev | write avg / write stdev - all in ms):
slat: 4.054 / 1.381 | 5.184 / 1.708
clat: 445,418.11 / 33.92 | 33,863.15 / 18,399.84
bw: 8.906 Mibps | 23.984 Mibps

So clearly there’s a massive lag for the remote when doing file operations. I believe fio breaks on the remote read call because the entire fio read command lasted the same 446s, and I can ls in the directory with what feels like 1 second of lag to see contents.

As noted above, rclone seems to be able to sidestep the issues restic and fio run into and operate at the full possible bandwidth while using its local driver (alias remote='/mnt/remote').

Switching from -r /mnt/remote/restic-backup to -r rclone:backup:restic-backup made it worse;
16s of send and 26s of gap yielding 15.24 Mibps avg.

Using the rclone direct samba remote rclone:backup-samba:backup_usr/restic-backup removes the gap but it cannot hit the maximum 40 Mbps ever and instead oscillates between 25 Mibps and 30 Mibps, which is still an improvement over the gap speed.

It takes a long, long time to do the initialization because this repo isn’t finished uploading yet, but I will try 20 concurrent connections instead of 10 in several hours. Edit: no improvement with more connections

Is the difference between rclone and restic’s handling of the local backend with a samba remote mount (which rclone has no trouble saturating 40 Mibps) that rclone confirms the uploaded data is safely written to the disk asyncronously and continues uploading new data while restic does not? Guessing at what yields the observed gap.

restic relies on a synchronous upload confirmation. My guess would be that rclone does not wait for such a confirmation at all and thus can continue faster.

Writing more files in parallel to an overloaded HDD won’t help. Did you check that -o local.connection=10 is really faster than the default? (I’d expect the gap to be shorter if less data was uploaded at once).

What might be more effective is increasing the pack size used by restic (which reduces the amount of upload confirmations). For that you have to pass the --pack-size 64 (some value between 16 and 128 is allowed) option to every restic command.

Thanks for taking the time to reply!

I had a mistake where I left the connection # modification on only local when using rclone, so the alias remote='/mnt/remote' method was likely worse because of the lower connection number of 5. I have found that higher connection #s elongate the send period compared to the gap, which seems to be largely static.

I do not believe the source drive or target drive is truly overloaded since rclone can hit the full 40 Mibps upload, but rather as you suggested some very slow file checks jam up restic for some reason.

Testing the restic local backend:
2c/128p - 62s send 22s gap, 74% duty cycle, 33.03 Mibps send (2*128*8/62), 24.38 Mibps real
5c/128p - 165s send, 35s gap, 84.61% duty cycle, 31.03 Mibps send, 26.26 Mibps real
10c/128p - it actually started to bleed together so I could not pick out “waves”, there were random short 10-20s gaps but nothing clear. It averaged near 40 Mibps 90% of the time save for the gaps, probably at least 36 Mibps.

Testing the rclone direct samba remote with 5c/128p and 5c/16p compared to 10c/16p did not change the bandwidth maximum limitation or fluctuations between 25/30 Mibps. Long-term averaging shows it pulls 29 Mibps (only way since I can’t know when a set starts or stops since no gaps). 10c/128p seemed a bit worse and 2c/128p seemed a bit better. 3c/128p seemed most optimal at 32 Mibps, but that’s really splitting hairs.

So rclone direct samba remote gets gapless upload but is hamstrung by a nebulous upload cap while the restic local backend at 10c/128p collapses the duty cycle to around 90% which I guess is as good as it’ll get! The downside, as I understand it, is that there are ~1.5 GiB (11*128MiB) of temp files being written to disk which is not the best for my SSD temp drive since it’ll have to write my several TB of data.

Thanks again for your help and suggestions, maximizing pack-size did help the restic local backend a lot and got me to the best perf for restic!

I know restic is a growing project - if it is possible to enable new chunk upload while checking previous chunks (and throwing previous chunks back in the upload queue if bad, of course!) that would likely fix my case of super high latency drives. But the solution I converged on is, as measured, about 90% close enough :wink:

[this testing was done on the just-released 0.18.0, changelog does not look to impact this issue]

I’ve noticed that my previous description of the upload behavior was rather besides the point. What currently happens is that files are cut into chunks that are fully parallel but still synchronously assembled into pack files. However, the upload of these pack files happens asynchronously as long as not all upload connections are busy. So restic is essentially already doing what you’ve suggested.

I see, and the bit about

the upload of these pack files happens asynchronously as long as not all upload connections are busy

fits with what I observed: 2 connections has a gap, 5 connections has a gap, but 10 connections doesn’t have a clear gap. So with like 2 or 5 connections, the upload happens and then very slow file operations that take up no bandwidth and are just latency-heavy for ?reasons? (samba most likely, wireguard seemed fine with its ~100 ms ping) hold a connection hostage for a bit letting me see a gap. With enough connections, they start to pile up on my low upload limit and manage to (almost) continuously upload. Only limit to overloading the connections is that extra connections cost 128MiB in (likely?) disk writes.

Thanks again for taking the time to answer and specify about how it all works!

In summary for anyone reading this:

  • To get the best performance (90% of max bandwidth), use the restic local backend with 10+ connections and 128MiB pack-size and locally mount your remote samba share via a vpn like wireguard sudo mount.cifs //$REMOTE_IP/$REMOTE_SHARE '/mnt/remote' -o user=$REMOTE_USER,rw,credentials="$REMOTE_PASSPATH" (or other very-high-latency-but-fine-bandwidth storage device) and call restic with restic -r '/mnt/remote/restic-backup-repo' -o local.connections=10 --pack-size 128.
  • To be the lightest on resources for 80% of max bandwidth, use the rclone direct samba remote way with sudo rclone config create 'rclone-backup-samba' smb host="$REMOTE_IP" user="$REMOTE_USER" pass="$REMOTE_PASS" --non-interactive --obscure and use it like restic -r "rclone:rclone-backup-samba:$REMOTE_SHARE/restic-backup-repo" -o rclone.connections=3 --pack-size 128
  • I cannot be sure if I had more upload bandwidth which would outpace, but for 40 Mbps upload and under these findings should be applicable.