Is it possible to change/suggest/hint different blob/pack file sizes?

I found the design doc (which is really good, thank you for THAT) as well as these sections:

There may be an arbitrary number of index files, containing information on non-disjoint sets of Packs. The number of packs described in a single file is chosen so that the file size is kept below 8 MiB.

Files smaller than 512 KiB are not split, Blobs are of 512 KiB to 8 MiB in size. The implementation aims for 1 MiB Blob size on average.

So first, my apologies as I’ve never done anything in go and don’t have particular coding skill in any other language, either… so while I’ve tried to scan the code a little this is out of my depth. Also let me make it clear that I’m not complaining or asking for any changes, I’m mostly just curious how it works. :wink:

Question 1: I’ve run a backup where the source data is ~647GB in size and ~94k unique files on the source file system. Backed up to B2, it’s 661GB and 128k files. This translates to roughly 5.1MB per “file” on the backend… which is very different than the “aims for 1MiB” mentioned in the docs. What might be causing this?

Question 2: is there a way to change, control, or hint to the system that you’d like the blobs to target a different size? I have a number of filesystems where the average file size is extremely small, but I also have quite a few where the average filesize is quite large and very static: these are video and/or image files where there are virtually never any “slight” changes to the files… they’re 50MB and they either stay that way forever, or if they are changed (i.e. transcoded) they’re edited in such a way that the entire file changes (i.e. content defined chunking as described here doesn’t help)

Again, I’m mostly just curious how this works and what things we do and do not have control over, rather than complaining about anything that actually matters. :wink:

Hey, thanks for your interest in the restic foundations!

The mentioned size in the docs is the size per blob, the files in the repo are actually so-called “pack files” which bundle more than one blob together. Sometimes a pack file contains several thousand blobs. A blob is a part of a file, after Content Defined Chunking (CDC) was run.

No, that’s not possible right now. The limits are at the moment hard coded into the chunker library. We may expose this at some point, but no plains exist for that to be user configurable. I can understand where you’re coming from, and I think restic will handle the situation well with the current size limits, even for such large static files.

I’m glad you’re asking :slight_smile:

Ah - ok… I found the chunker library and specifically where these are set… so are you saying that if I manually mangled the chunker min/max sizes this would flow through to the pack file sizes? Sorry if this is a dense question, I think the hierarchy is:

the chunker looks at chunks of a file, computes the rolling hash, and eventually spits out blobs. Then <<something?>> takes those blobs and combines them into pack files. Then we store the pack files in the repo… ?

SO… maybe what I’m asking for is actually what controls how large a pack file ends up being, rather than how large the component blobs are?

This all came about from my messing around and noticing that whenever I do a forget/prune, it creates a ton of system calls to/from the repo. Or I should say at least on a b2 backend, which I use… I can see the number of what Backblaze calls Class B transactions incrementing rapidly. These aren’t free (although please don’t think I’m complaining about spending 5c/day…) so I was trying to reverse engineer why these were constantly getting called even when I’m doing a prune that has very little state that actually changed.

I’m further assuming that the API actually getting called over and over is b2_get_file_info : if I could make the pack files larger this would, at least in my use case, greatly reduce the number of files that live in the repo, and thus a prune would result in a lot fewer API calls, and that would change my daily prune cost from 8c down to 4c! :grinning:

But really, I’m just trying to dig into how it all works… because this is at least as interesting as my day job. :wink:

You got the hierarchy right, chunks from the chunker are used as blobs, then bundled into pack files. The limit for the pack file is configured in internal/repository/packer_manager.go here. As soon as a pack file is larger than 4MiB, it’s uploaded to the repo. You can play around with the value to get larger pack files.

Yeah, we’re aware that the prune function can be optimized a lot. We’re very conservative with that, because it’s the one function which really deletes data permanently. If we get it wrong, it’ll destroy peoples backup.

Cool. Your comment about slow-rolling any changes to prune makes perfect sense to me:
safety here is clearly the better part of valor. Can I help out by testing or profiling things more?

Not really, sorry. It’s on the todo list, but other things are more important right now (such as the new archiver code (#1494)).

just touching base on this; been a few years since this conversation and since then prune a lot better (as well as other improvements) - wondering if can revisit exposing custom blob/pack file sizes.

My current backup set is 550 Gb or so which results in 110k files. Having so many files does make it difficult to work with on the file store backend and would be nice to bring down to say 10k files…

1 Like

This PR allows to set a minimum size for pack files, but it hasn’t been merged yet:

Oh good to see. I will just wait till it (eventually) goes mainline then.