I’ve got a deployment with three hosts spread across the world backing up hourly via rest-server to a single repository located on one of them (they share a good bit of data), and then a forget/prune that runs nightly on that host before syncing off-site.
Unfortunately, as the data set size has grown, it now seems to be impossible for the forget command to ever get a chance to run due to contention on the shared/exclusive lock. It’s gotten to the point where each backup run takes 15-45 minutes to complete, and with the backup lock being shared they’re regularly running over themselves, never leaving a window for forget to acquire an exclusive lock no matter how long it waits. I’ve just updated to 0.16.0 for --retry-lock
(from 0.9.5), which is sadly still not getting a chance, but even with my poor manual retry attempts before it’s been over a year since it’s been able to run a forget/prune (which is certainly not helping with the repository size!)
It feels like what’s missing here is that forget needs to be able to signal its intention to acquire an exclusive lock with a new lock type, which’ll then prevent other shared locks from being acquired from that point, and become an exclusive lock once existing shared locks have been released. This would still leave the contention (partly shifting it to the backup side), but I think that’s inevitable and fine for most use cases where this would be a problem, but it would mean that operations requiring an exclusive lock couldn’t be starved forever, which I think is a big improvement.
What do you think? Would that be something that could be implemented sanely?
Alternatively, are there other architectures that might work better? One idea I’ve seen perusing the forum today is to have hosts backup to a “staging” repository, then merge the snapshots together in the final repository. But this feels like a large complication of an otherwise simple / obvious setup, and would mostly make sense to solve the lock contention if that was a problem for a deployment (whereas I don’t mind if the odd backup gets missed, versus prune never being able to run at all).