I had given up on parallel back-up to the same repo with restic, because it just didn’t work.
Someone (below) suggested that parallel back-ups should specify a different hostname. Is this true? If so why?
I was backing up up from a cluster, so there are no real host names, so I was always passing the name of the cluster as the hostname. So many parallel back-up has the same hostname. Was this my problem?
As someone with more than a passing familiarity with restic (but not exactly intimate knowledge of the source code, either) I can say that I see no reason why using the same or different hostnames would have any effect on the performance of parallel backups. I’m curious of the reasoning that led to this conclusion, because I can’t even fabricate a plausible-yet-wrong train of thought that would lead there.
The hostname is used in exactly two places (AFAIK) when performing a backup: the value is used in selection of the parent snapshot (only if --parent is not given), and the value is recorded in the new snapshot object. I cannot see how either of these operations would be slower because a concurrent restic process is using the same hostname.
When the jobs were taking much longer and failing, how did the consumption of RAM by restic processes differ to the values you see now, using multiple repositories? I ask because, assuming an identical amount of raw, non-deduplicated data, a single repository will have a bigger index than each index (separately) in multiple repositories. Restic memory use, in my experience, scales nearly linearly with the size of the repository index. So using multiple repositories should lower the memory consumption of each restic process.
A larger index, and therefore higher memory consumption, could cause any of the following, in order:
Longer restic startup times since more index data has to be read into memory.
Higher/any swap use, which would dramatically slow down restic (and probably other stuff on the same system).
Thrashing / swap death as processes on the box compete for their memory to be paged in.
The OOM killer remedying the situation by killing restic, which would likely be the biggest consumer of memory on the system.
2 and 3 could easily explain a 30-fold increase in backup times, and failed backups would be explained by 4.
If these multiple restic processes could have been running on the same physical machine (depending on the configuration of your cluster) then these problems would be amplified on that machine.
tl;dr: The hostname being the same is, to the best of my knowledge, a red herring. The use of multiple repositories is likely what solved the problem because this means a smaller index per repository, which lowers the memory consumption of restic.
Multiple repositories are probably spread across multiple harddisks/raids while using only 1 repository might overwhelm the hardisk/raid. AFAIK Backblaze is still using ordinary low speed harddisks for their service. Are you by any chance able to do some testing on a SSD based repository?
Thanks @cdhowie, I agree I can’t see how hostname would impact like this, that’s why I came here, in case their was some secret sauce behavior I just didn’t know about
It is not an issue of swap or RAM. These are cluster environments, there is no swap and unlimited RAM. Each back-up job gets its own reserved RAM and CPUs and very likely running on different cluster nodes.
The differences between the two scenarios is that in the shared repo scenario, the repo at the remote end is shared, and the repo cache at the restic end is shared. So I am guessing the problem is contention/interference between the multiple backups at one of those places.
Your point about a larger index is a good one. Except, the per-backup repos for individual backups have 20,000+ snapshots in each index at this point, and the individual repo backups aren’t slowing down. But certainly the number of different paths/files would be much much greater with multiple backups in the same repo.
The most interesting thing you said is about identifying the parent. That would I think be more work and require grokking a greater amount of index with many different back-ups adding snapshots to the same repository (--parent is not given).
If there is contention for a local cache then you could try --no-cache to see if that makes restic perform faster.
The only thing I can think of that is inherent to parallel backups that would possibly be slow is locking behavior: restic has to download each existing lock to make sure it’s not an exclusive lock, create its shared lock, then re-check to make sure that an exclusive lock was not created in the meantime. In all other respects, a parallel backup to the same repository proceeds as though no other system is accessing the repository at all; the processes will just add files without any kind of coordination. This is safe because each file’s name is the SHA-256 hash of its contents, so two processes writing to the same filename must be writing the same contents (except in the near-impossible chance of a hash collision) and so it doesn’t matter which process wins.
I would be very interested in a profile of restic in your current setup and a profile from when it was failing.