Control the minimal pack files size

kayrus · April 25, 2018, 9:17am

I have a huge amount of files (~100 millions and total size ~50Tb), which should be backed up in Openstack Swift. Unfortunately Swift doesn’t recommend to store more than 1 million files in one container.

I was hoping that restic can pack multiple files into one pack and it actually can, but according to the documentation the maximum pack file size is 8 MiB (my tests showed me 4 MiB and actually here is why: https://github.com/restic/restic/blob/2aa6b496519ef65c5cedc02aaaf2f3495137e6a5/internal/repository/packer_manager.go#L39).

I actually don’t want to struggle with multiple Swift containers, this will extremely complicate the backup task.

Can I face any negative consequences if I’d recompile the source code with the increased minPackSize?

fd0 · April 25, 2018, 10:48am

Interesting limitation I wasn’t aware of, thanks!

You can raise the minimal pack size and it should not break anything, but you may experience much more memory usage. Please give it a try and report back, I’m very interested in your report!

kayrus · April 25, 2018, 1:48pm

I started to introduce min-pack-size option (see https://github.com/kayrus/restic/commit/71c13bab23ecb45f23cafb57609930fb9ffba806) and actually noticed some potential limitations if I would process a huge amount of files on 32bit system. I suppose you have to add extra notes in documentation regarding the 32bit architectures. It might not fit files greater than 4Gb.

Another question, is where should I put this option? Repo wide or global wide? So far I introduced it inside the repo config.

Nevertheless, my changes work, but for some reason chunks (pack) files are mostly 250MiB instead of 500 MiB. I set 500 MiB in the CLI:

--min-pack-size 500

Did I miss something? Is there an extra logic, which shrinks pack size?

kayrus · April 25, 2018, 2:00pm

Additionally it looks like restic doesn’t upload the data while it creates a backup. It takes time to create a backup first, then ETA shows 0:00 and at this point restic actually starts to upload the data. Is that the way it should be?

kayrus · April 26, 2018, 11:35am

Quick update: looks like restic has a pool of max 10 workers for each pack (I actually could not find this limit in the code), in my case I tried to create a backup of the whole VM with the total size of 2Gb. And basically restic spread 2Gb along 10 workers, therefore we got ~200 Mb. Besides all these 10 workers waited till they fill in the whole pack, and that is why the actual pack upload into the Swift happened when these workers have done their job.

There is still an open question, where exactly should I put a new option. @fd0, please advice.

fd0 · April 28, 2018, 8:22pm

Thank you for describing the results of your experiments. I’m not convinced this is a good idea. All the difficulties you described are probably caused by the current archiver code, which will be replaced by the new code (see PR #1494), which I’ll merge shortly. Have a look over there please

gabor · January 10, 2019, 8:42pm

sorry for reviving an old thread, I just got faced with the same issue.
I’m backing up just 3TB of data to Google Drive via the rclone provider, and I ran into the 400.000 files limit quota enforced by Google. Is there maybe a way to decrease the number of files being created during the backup runs?

cfbao · January 10, 2019, 9:22pm

AFAIK, this limit is only enforced on a single Team Drive. A Team Drive is meant to be used for a specific project, not general purpose data storage.
There’s no limit on number of files outside of Team Drive.