I haven’t had a chance to try PR 2513 yet, because I’m allowing existing prune operations to finish (I have a few more repos, though). Even deleting packs is taking a very long time. Is this normal or just B2 getting in the way? Does PR 2513 address it with more requests in parallel?
On a cloud server to B2, multi-gig bandwidth:
repository ... opened successfully, password is correct
counting files in repo
building new index for repo
[4:51:12] 100.00% 209957 / 209957 packs
repository contains 209957 packs (2750342 blobs) with 1.001 TiB
processed 2750342 blobs: 0 duplicate blobs, 0 B duplicate
load all snapshots
find data that is still in use for 19 snapshots
[1:17] 100.00% 19 / 19 snapshots
found 461704 of 2750342 data blobs still in use, removing 2288638 blobs
will remove 0 invalid files
will delete 170280 packs and rewrite 25605 packs, this frees 913.599 GiB
[11:49:09] 100.00% 25605 / 25605 packs rewritten
counting files in repo
[21:11] 100.00% 23802 / 23802 packs
finding old index files
saved new indexes as [...]
remove 1823 old index files
[6:53:33] 13.99% 27413 / 195885 packs deleted
In the process of these very long prune operations, the cache is shrinking considerably. So at least in my case, large caches have been caused by the retention of data (lots of small files) for almost a year.
I did that for a while, approximately 1 year, but I wedged myself into a corner when it came time to prune. On a couple VMs, I actually had to perform the prune on a desktop with over 200 GB of free space. The entire cache was needed to perform the prune, and it took 3 days to complete. However, when the prune was complete, the cache was small (around 1-10GB depending on the machine).
I suggest pruning every 14 days instead. It only takes around 1 hr if you do it once a month and the cache stays small. This is assuming you have a reasonable snapshot policy in place that results in 25 snapshots retained. Your cache will always remain large if you keep every snapshot.
I think performance improvements for prune are needed so people are more likely to do it, but also clarity in the documentation would help a lot of people understand it really is needed on a semi-regular basis. Sure, you don’t have to do it, but for any active workload being backed up, your remote repo will quickly exceed 1 TB and the cache will become unmanageable.