I found the design doc (which is really good, thank you for THAT) as well as these sections:
There may be an arbitrary number of index files, containing information on non-disjoint sets of Packs. The number of packs described in a single file is chosen so that the file size is kept below 8 MiB.
Files smaller than 512 KiB are not split, Blobs are of 512 KiB to 8 MiB in size. The implementation aims for 1 MiB Blob size on average.
So first, my apologies as I’ve never done anything in go and don’t have particular coding skill in any other language, either… so while I’ve tried to scan the code a little this is out of my depth. Also let me make it clear that I’m not complaining or asking for any changes, I’m mostly just curious how it works.
Question 1: I’ve run a backup where the source data is ~647GB in size and ~94k unique files on the source file system. Backed up to B2, it’s 661GB and 128k files. This translates to roughly 5.1MB per “file” on the backend… which is very different than the “aims for 1MiB” mentioned in the docs. What might be causing this?
Question 2: is there a way to change, control, or hint to the system that you’d like the blobs to target a different size? I have a number of filesystems where the average file size is extremely small, but I also have quite a few where the average filesize is quite large and very static: these are video and/or image files where there are virtually never any “slight” changes to the files… they’re 50MB and they either stay that way forever, or if they are changed (i.e. transcoded) they’re edited in such a way that the entire file changes (i.e. content defined chunking as described here doesn’t help)
Again, I’m mostly just curious how this works and what things we do and do not have control over, rather than complaining about anything that actually matters.