Is it possible to change/suggest/hint different blob/pack file sizes?

ljwobker · January 2, 2018, 12:30am

I found the design doc (which is really good, thank you for THAT) as well as these sections:

There may be an arbitrary number of index files, containing information on non-disjoint sets of Packs. The number of packs described in a single file is chosen so that the file size is kept below 8 MiB.

Files smaller than 512 KiB are not split, Blobs are of 512 KiB to 8 MiB in size. The implementation aims for 1 MiB Blob size on average.

So first, my apologies as I’ve never done anything in go and don’t have particular coding skill in any other language, either… so while I’ve tried to scan the code a little this is out of my depth. Also let me make it clear that I’m not complaining or asking for any changes, I’m mostly just curious how it works.

Question 1: I’ve run a backup where the source data is ~647GB in size and ~94k unique files on the source file system. Backed up to B2, it’s 661GB and 128k files. This translates to roughly 5.1MB per “file” on the backend… which is very different than the “aims for 1MiB” mentioned in the docs. What might be causing this?

Question 2: is there a way to change, control, or hint to the system that you’d like the blobs to target a different size? I have a number of filesystems where the average file size is extremely small, but I also have quite a few where the average filesize is quite large and very static: these are video and/or image files where there are virtually never any “slight” changes to the files… they’re 50MB and they either stay that way forever, or if they are changed (i.e. transcoded) they’re edited in such a way that the entire file changes (i.e. content defined chunking as described here doesn’t help)

Again, I’m mostly just curious how this works and what things we do and do not have control over, rather than complaining about anything that actually matters.

fd0 · January 2, 2018, 7:29pm

Hey, thanks for your interest in the restic foundations!

The mentioned size in the docs is the size per blob, the files in the repo are actually so-called “pack files” which bundle more than one blob together. Sometimes a pack file contains several thousand blobs. A blob is a part of a file, after Content Defined Chunking (CDC) was run.

No, that’s not possible right now. The limits are at the moment hard coded into the chunker library. We may expose this at some point, but no plains exist for that to be user configurable. I can understand where you’re coming from, and I think restic will handle the situation well with the current size limits, even for such large static files.

I’m glad you’re asking

ljwobker · January 5, 2018, 9:31pm

Ah - ok… I found the chunker library and specifically where these are set… so are you saying that if I manually mangled the chunker min/max sizes this would flow through to the pack file sizes? Sorry if this is a dense question, I think the hierarchy is:

the chunker looks at chunks of a file, computes the rolling hash, and eventually spits out blobs. Then <<something?>> takes those blobs and combines them into pack files. Then we store the pack files in the repo… ?

SO… maybe what I’m asking for is actually what controls how large a pack file ends up being, rather than how large the component blobs are?

This all came about from my messing around and noticing that whenever I do a forget/prune, it creates a ton of system calls to/from the repo. Or I should say at least on a b2 backend, which I use… I can see the number of what Backblaze calls Class B transactions incrementing rapidly. These aren’t free (although please don’t think I’m complaining about spending 5c/day…) so I was trying to reverse engineer why these were constantly getting called even when I’m doing a prune that has very little state that actually changed.

I’m further assuming that the API actually getting called over and over is b2_get_file_info : if I could make the pack files larger this would, at least in my use case, greatly reduce the number of files that live in the repo, and thus a prune would result in a lot fewer API calls, and that would change my daily prune cost from 8c down to 4c!

But really, I’m just trying to dig into how it all works… because this is at least as interesting as my day job.

fd0 · January 6, 2018, 11:20am

You got the hierarchy right, chunks from the chunker are used as blobs, then bundled into pack files. The limit for the pack file is configured in internal/repository/packer_manager.go here. As soon as a pack file is larger than 4MiB, it’s uploaded to the repo. You can play around with the value to get larger pack files.

Yeah, we’re aware that the prune function can be optimized a lot. We’re very conservative with that, because it’s the one function which really deletes data permanently. If we get it wrong, it’ll destroy peoples backup.

ljwobker · January 7, 2018, 6:54pm

Cool. Your comment about slow-rolling any changes to prune makes perfect sense to me:
safety here is clearly the better part of valor. Can I help out by testing or profiling things more?

fd0 · January 7, 2018, 7:18pm

Not really, sorry. It’s on the todo list, but other things are more important right now (such as the new archiver code (#1494)).

kellytrinh · April 6, 2021, 3:32am

just touching base on this; been a few years since this conversation and since then prune a lot better (as well as other improvements) - wondering if can revisit exposing custom blob/pack file sizes.

My current backup set is 550 Gb or so which results in 110k files. Having so many files does make it difficult to work with on the file store backend and would be nice to bring down to say 10k files…

764287 · April 6, 2021, 6:26am

This PR allows to set a minimum size for pack files, but it hasn’t been merged yet:

kellytrinh · April 6, 2021, 8:52am

Oh good to see. I will just wait till it (eventually) goes mainline then.

gurkan · June 24, 2021, 9:29am

Sorry to necro-bump, just wanted to mention; this turned out to be an important feature if you have s3 backend. A rather large bucket caused us to get rate limited so bad while syncing it, due to total file/request count.

Moulick · June 27, 2021, 6:14am

@gurkan Hey, how did you notice the rate limit. Also any steps you have taken to improve performance with aws s3 backend ?

gurkan · June 27, 2021, 8:30am

Scaleway has a funny message:

<ERROR> Failed to remove `https://s3.nl-ams.scw.cloud/****`. Please reduce your request rate. You are not being rate-limited. Please contact the support if the issue persists.

Didn’t do much specific to S3 actually, the feature mentioned above-linked PR would’ve helped though. But I am not using AWS I must note, just Minio and scaleway.

firstlight · June 30, 2021, 1:45am

I thought I’d drop in some information on a use case for myself where being able to increase the pack size would be beneficial. I promise it isn’t to (ab)use an overlay file system on a unlimited cloud drive!

I’ve started using Storj, a decentralised file storage service with an Amazon S3 compatible gateway. Although it isn’t as relevant to Restic they support native encryption where the server has no knowledge of the client’s encryption keys.

Where the ability to alter the pack size would be beneficial is that for Storj, a segment, the smallest allocation unit is 64 MB. The current default pack size uses space inefficiently when using Storj as a backend.

As a result, my backup of ~700 GB has quickly exceeded the Storj 50,000 segment/month threshold and I will need to pay a per segment fee. This would roughly add $2.2 per TB/month (taking the worst case scenario of 4MB pack size per segment) on top of storage costs.

If I could increase the minimum pack size to a higher value, this will improve the utilization in a segment and decrease storage costs.

Finally, thank you for Restic!