Does restic parallel back-up behavior change if the same hostname is used?

whereisaaron · October 28, 2019, 5:23pm

I had given up on parallel back-up to the same repo with restic, because it just didn’t work.

Someone (below) suggested that parallel back-ups should specify a different hostname. Is this true? If so why?

I was backing up up from a cluster, so there are no real host names, so I was always passing the name of the cluster as the hostname. So many parallel back-up has the same hostname. Was this my problem?

moritzdietz · October 28, 2019, 7:54pm

The restic backup command does not exclusively lock the repository. So parallel backups to it work.
AFAIK it doesn’t matter if you have the same hostname or not.

What problem were you seeing with your backups if I may ask?

whereisaaron · October 30, 2019, 4:29am

Right now I am running parallel backups to different repos in the same B2 Bucket and everything is smooth and each backup take 1-5 minutes.

When running those same parallel back-ups to the same repo in the B2 Bucket, they would get slower and slower, 30-60 minutes (instead of 1-5 minutes), and often just fail.

Exactly the same backup load, plenty of excess bandwidth, the difference is one repo or multiple.

The only things of notes are that I identified were:

Being a remote B2 bucket, latency is high, ~150ms
The repo cache folder is shared the restic end
I was using the same hostname for all of the backups

@hossainemruz has a ton of experience with restic, so when he says using different hostnames matters, I tend to believe him. But like you AFAIK it didn’t matter.

cdhowie · October 30, 2019, 6:06am

As someone with more than a passing familiarity with restic (but not exactly intimate knowledge of the source code, either) I can say that I see no reason why using the same or different hostnames would have any effect on the performance of parallel backups. I’m curious of the reasoning that led to this conclusion, because I can’t even fabricate a plausible-yet-wrong train of thought that would lead there.

The hostname is used in exactly two places (AFAIK) when performing a backup: the value is used in selection of the parent snapshot (only if --parent is not given), and the value is recorded in the new snapshot object. I cannot see how either of these operations would be slower because a concurrent restic process is using the same hostname.

When the jobs were taking much longer and failing, how did the consumption of RAM by restic processes differ to the values you see now, using multiple repositories? I ask because, assuming an identical amount of raw, non-deduplicated data, a single repository will have a bigger index than each index (separately) in multiple repositories. Restic memory use, in my experience, scales nearly linearly with the size of the repository index. So using multiple repositories should lower the memory consumption of each restic process.

A larger index, and therefore higher memory consumption, could cause any of the following, in order:

Longer restic startup times since more index data has to be read into memory.
Higher/any swap use, which would dramatically slow down restic (and probably other stuff on the same system).
Thrashing / swap death as processes on the box compete for their memory to be paged in.
The OOM killer remedying the situation by killing restic, which would likely be the biggest consumer of memory on the system.

2 and 3 could easily explain a 30-fold increase in backup times, and failed backups would be explained by 4.

If these multiple restic processes could have been running on the same physical machine (depending on the configuration of your cluster) then these problems would be amplified on that machine.

tl;dr: The hostname being the same is, to the best of my knowledge, a red herring. The use of multiple repositories is likely what solved the problem because this means a smaller index per repository, which lowers the memory consumption of restic.

764287 · October 30, 2019, 9:17am

Multiple repositories are probably spread across multiple harddisks/raids while using only 1 repository might overwhelm the hardisk/raid. AFAIK Backblaze is still using ordinary low speed harddisks for their service. Are you by any chance able to do some testing on a SSD based repository?

whereisaaron · October 30, 2019, 1:27pm

Thanks @cdhowie, I agree I can’t see how hostname would impact like this, that’s why I came here, in case their was some secret sauce behavior I just didn’t know about

It is not an issue of swap or RAM. These are cluster environments, there is no swap and unlimited RAM. Each back-up job gets its own reserved RAM and CPUs and very likely running on different cluster nodes.

The differences between the two scenarios is that in the shared repo scenario, the repo at the remote end is shared, and the repo cache at the restic end is shared. So I am guessing the problem is contention/interference between the multiple backups at one of those places.

Your point about a larger index is a good one. Except, the per-backup repos for individual backups have 20,000+ snapshots in each index at this point, and the individual repo backups aren’t slowing down. But certainly the number of different paths/files would be much much greater with multiple backups in the same repo.

The most interesting thing you said is about identifying the parent. That would I think be more work and require grokking a greater amount of index with many different back-ups adding snapshots to the same repository (--parent is not given).

Thanks @764287 but that is not credible.

It is the same bucket in both cases so it is the same B2 endpoint/SAN
Each repo is multiple files and directories, B2 can’t tell if a given collection of objects in a bucket is one repo or multiple repos
The throughput of spinning disks is irrelevant when you have 100’s of them in a SAN
The latency of spinning disk is irrelevant with the network latency is already two orders or magnitude higher!
The amount of data is tiny, less that 1GB per backup

SSD vs HD is very relevant for local repos, but not a thing for huge, remote SAN’s. I admire your optimism that I can overwhelm B2 with my puny data streams though

cdhowie · October 30, 2019, 2:21pm

If there is contention for a local cache then you could try --no-cache to see if that makes restic perform faster.

The only thing I can think of that is inherent to parallel backups that would possibly be slow is locking behavior: restic has to download each existing lock to make sure it’s not an exclusive lock, create its shared lock, then re-check to make sure that an exclusive lock was not created in the meantime. In all other respects, a parallel backup to the same repository proceeds as though no other system is accessing the repository at all; the processes will just add files without any kind of coordination. This is safe because each file’s name is the SHA-256 hash of its contents, so two processes writing to the same filename must be writing the same contents (except in the near-impossible chance of a hash collision) and so it doesn’t matter which process wins.

I would be very interested in a profile of restic in your current setup and a profile from when it was failing.