Odd repo size with max compression?

@MichaelEischer I was going to ask about that! But I didn’t want to clamor for a “conversion method” when I know it’s already being worked on haha. Thanks!


For the fun of it, I backed up my home folder to my existing v1 repository, then made a backup to two fresh v2 repos (one auto, one max):

  • In restore-mode, the total used space is 77.08 GiB
  • In v1 raw-data mode, it is 70.86 GiB
  • In v2 auto, it is 56.92 GiB (in 1:18:30)1,2
  • In v2 max, it is 76.74 GiB, surprisingly (in 59:19)2

My main takeaway from this is, if you’re going to use v2-max, make sure you don’t have already-compressed data! If you have a mix, definitely use v2-auto! The overhead for compressing already-compressed data might actually make it worse than no compression at all, apparently.

  1. I was running the v1 to Backblaze B2 backup, and the v2 auto to disk backup simultaneously, so the backup time is likely skewed. V2 max ran separately.
  2. restic stats --mode raw-data returns the same amount for both v1 and v2 auto/max repositories; so you’ll need to look at the actual used disk space to see the compression results.
1 Like

That explanation is incompatible with the implementation. Both auto and max compress every data blob, the difference is only how much effort zstd spends in shrinking the data size. In case some data blob turns out to be incompressible, then it just stores the raw data.

2 Likes

As @MichaelEischer wrote, that result is very unexpected. Do you maybe still have the log from that run of restic?

Yes, we’ll add stats for compressed/uncompressed later.

Interesting. I did see somewhere that both compressed every blob, and was going to ask later how auto truly worked, after I poured through the GitHub threads and made sure I wasn’t asking a redundant question haha. Very strange result, then…

So am I correct in saying that auto is something like “zstd -3” and max something like “zstd -19” and in either case if Restic sees that the compressed blob is bigger than the original blob, Restic discards the compressed blob and just uses the uncompressed blob?

I do not, but I’m happy to try it again. I, too, was pretty surprised, and actually recreated the repo from scratch, ensuring --compression max was used for the init command, and then just to be 100% sure the second time, specifically using --compression max for the backup command as well (EDIT: or so I thought - I’m doubting this after subsequent testing with Michael below). I’m assuming that whichever compression switch you use at init time is set as the “default” and can be “overridden” for each backup. That’s how it appeared to work on auto, when I tested using no switches on a blank repo, then using auto, and then max, on two other empty repos, and comparing the final repo sizes (on a smaller file set, just to get a “feel” for how it worked).

I’m currently using auto on about 17TB of data, just to see what happens :joy: but also to see if I can break anything or flush out any bugs before compression gets pushed into self-update or brew upgrade.

So I’ll also start up:
restic_v2 init --repository-version latest --compression max -r /Volumes/Backup/repo_v2

Then do a:
restic_v2 backup --compression max -r /Volumes/Backup/repo_v2 ~ -v 2>~/Desktop/restic_v2_error.log 1>~/Desktop/restic_v2_output.log

And report back. :slight_smile:

2 Likes

Yes auto and max just correspond to different zstd levels. The detection that a blob is not compressible is already handled by zstd itself, so there’s no additional check for that in restic.

No, the init command does not even look at the --compression switch. A v2 repo always defaults to auto as compression level unless you specify something different for a command.

Okay, so, to repeat the initial test I did to get a feel for how the new repository format behaves…

I just now ran these three commands:

restic_v2 init --repository-version latest --compression off -r /Volumes/Fortress_L3/test/test_off

restic_v2 init --repository-version latest --compression auto -r /Volumes/Fortress_L3/test/test_auto

restic_v2 init --repository-version latest --compression max -r /Volumes/Fortress_L3/test/test_max

I then copied my ~/Library/Log folder to “Log copy” for a static test, as I assume it would be fairly compressible, and ran the following three commands:

restic_v2 backup /Users/akrabu/Library/Logs\ copy -r /Volumes/Fortress_L3/test/test_off

restic_v2 backup /Users/akrabu/Library/Logs\ copy -r /Volumes/Fortress_L3/test/test_auto

restic_v2 backup /Users/akrabu/Library/Logs\ copy -r /Volumes/Fortress_L3/test/test_max

Then I run du and get the following:

du -d 1 /Volumes/Backup | rg test
238336 /Volumes/Fortress_L3/test/test_off
236800 /Volumes/Fortress_L3/test/test_auto
235776 /Volumes/Fortress_L3/test/test_max

Screen Shot 2022-05-01 at 11.04.34 AM

SOMETHING is going on, though considering what you just said, I have no idea what that something is haha


So with that in mind I ran:

restic_v2 init --repository-version latest -r /Volumes/Fortress_L3/test/test_off2

restic_v2 init --repository-version latest -r /Volumes/Fortress_L3/test/test_auto2

restic_v2 init --repository-version latest -r /Volumes/Fortress_L3/test/test_max2

And then this time ran:

restic_v2 backup /Users/akrabu/Library/Logs\ copy -r /Volumes/Fortress_L3/test/test_off2 --compression off

restic_v2 backup /Users/akrabu/Library/Logs\ copy -r /Volumes/Fortress_L3/test/test_auto2 --compression auto

restic_v2 backup /Users/akrabu/Library/Logs\ copy -r /Volumes/Fortress_L3/test/test_max2 --compression max

And yes there’s now a much more significant difference!

du -d 1 . | rg test | sort -r
943360 ./test_off2
243200 ./test_auto2
238336 ./test_off
236800 ./test_auto
235776 ./test_max
219648 ./test_max2

Screen Shot 2022-05-01 at 11.13.52 AM

So I guess now the question is, what’s with the random size fluctuations between what should have all been essentially “auto” repositories in the first test batch? Is restic using Zstd with something like zstd --adapt where the compression level changes on the fly, thus every run could be different? This is a copy of my log folder, so the source is static. But yes, this test is what initially fooled me into thinking that --compression was taken into consideration by init.

That said, I think it would be handy if you COULD specify a “default compression method” with init. :wink:


Okay SOMETHING very strange is going on, because I ran a third max test to see if the size would remain the same at least between max runs, but…

restic_v2 init --repository-version latest -r /Volumes/Fortress_L3/test/test_max3

restic_v2 backup /Users/akrabu/Library/Logs\ copy -r /Volumes/Fortress_L3/test/test_max3 --compression max

du -d 1 . | rg test | sort -r                                                                           
943360	./test_off2
254976	./test_max3
243200	./test_auto2
238336	./test_off
236800	./test_auto
235776	./test_max
219648	./test_max2

image

:face_with_raised_eyebrow:


Okay, one more max run just to see…

restic_v2 init --repository-version latest -r /Volumes/Fortress_L3/test/test_max4

restic_v2 backup /Users/akrabu/Library/Logs\ copy -r /Volumes/Fortress_L3/test/test_max4 --compression max

du -d 1 . | rg test | sort -r                                                                           
943360	./test_off2
254976	./test_max3
243200	./test_auto2
238336	./test_off
236800	./test_auto
235776	./test_max
235264	./test_max4
219648	./test_max2

Screen Shot 2022-05-01 at 11.44.54 AM

Yeahhhh, this is just all over the place. Did I find the first bug?? :smiley:

Unfortunately not :wink:

When you initialize a new repository, restic will choose a random initialization value for the chunking algorithm used for deduplication. So for each repo, restic will split larger files in a slightly different way into blobs. Since compression is done per blob, this will yield slightly different sizes.

If you want to test this you can initialize the first repo, then initialize the other ones but copying the chunker parameters from the first one:

$ restic_v2 init --repository-version latest -r /Volumes/Fortress_L3/test/test_max5
$ restic_v2 init --repository-version latest -r /Volumes/Fortress_L3/test/test_max6 --copy-chunker-params --repo2 /Volumes/Fortress_L3/test/test_max5

Then try again with these repos with --compression max, it should give you almost exactly the same sizes.

2 Likes

FYI, I’ve just split this out as a new topic so it won’t clutter the announcement thread :slight_smile:

1 Like

Ah, darn! Thought I was on to something :wink:

Good to know, though!

Okay, this time I initialized one repo, then just duplicated it five times.

I then ran:

restic_v2 backup /Users/akrabu/Library/Logs\ copy -r /Volumes/Fortress_L3/test/test_off --compression off
restic_v2 backup /Users/akrabu/Library/Logs\ copy -r /Volumes/Fortress_L3/test/test_auto --compression auto
restic_v2 backup /Users/akrabu/Library/Logs\ copy -r /Volumes/Fortress_L3/test/test_max --compression max
restic_v2 backup /Users/akrabu/Library/Logs\ copy -r /Volumes/Fortress_L3/test/test_auto2 --compression auto
restic_v2 backup /Users/akrabu/Library/Logs\ copy -r /Volumes/Fortress_L3/test/test_max2 --compression max

And got a much more expected result. :slight_smile:

Screen Shot 2022-05-01 at 12.22.30 PM

Question - is it possible to do any fine-tuning for the chunker parameters? I did notice at one point a 17.5MB difference between “max2” and “max3”. Or is this something I’d need to change the source code and build myself? And by doing so would I be decreasing the encryption security as well? Assuming that’s why it’s randomized…

1 Like

The profile backup finally finished. Considering how long it was taking, I was pretty sure I had goofed the first time around and not specified --compression max for the backup command. I’m confirming that now, because it’s only 61.27GiB now. “Auto” still beat it, but that was yesterday and who knows how much my profile has grown since. I’ll continue playing and report back if I notice anything weird. Thanks!