Could changing the hard-coded average chunking size break things?

nstgc · August 11, 2022, 7:25pm

I know the average chunk size is hard-coded to 1MiB, however what I need to back up is on average around 10kb. As a result Restic can’t effectively dedup my data. But there is data to deduplicate, and Bup can find quite a bit duplicates. Of course, Bup has many other short comings.

So poking around the code I found the project rest/chunker where the 20bits of zeros is defined.

If I were to change that 20 in each instance in that project (assuming it relates to the chunk size) and compile that myself, would it break things?

I’m guessing the answer is “yes”, since otherwise implementing an option to change it would be trivial (or so it seems). But still, I feel like I should at least try asking.

alexweiss · August 11, 2022, 8:07pm

No, this should not break anything. But keep in mind that you must only access the repo with the same hard-coded binary - else you completely loose deduplication.

And there might be some side effects as smaller chunk sizes means you have much more chunks and therefore larger trees and a larger index.

nstgc · August 11, 2022, 9:19pm

Thanks. I decided to give it a shot. I’m actually surprised that I got it on my first try. I was sure it would blow up in some spectacular way. (It may still fail someway else, of course.)

I also changed the min and max chunk sizes specified in chunker.go.

edit: More surprising is that it did little to help deduplication. There are definitely more objects, and the index is orders of magnitude larger, yet I’m still getting not seeing the space savings of Bup despite targeting the same average chunk size. Even when Bup has compression off and Restic has it on, Bup’s backups are much smaller. By about a factor of 3.

This is a bit frustrating since I like Restic better overall, but a factor of three (six with compression on) is hard for me to ignore.