Restic compression not compressing almost anything for my repo

jarm0 · August 27, 2022, 1:58pm

Hi!

Thank you for this wonderful tool. I’ve been using it for years by now and love it. After seeing that compression has been officially released I gave it a try.

I’m backing up a Linux server to Backblaze, which includes a lot of log files, database backups, jpg images etc. There are 270 snapshots and total repo size before compression 2924.8 GB.

After getting new restic version I ran the following commands:

restic migrate upgrade_repo_v2 --compression max
restic prune --repack-uncompressed --compression max --max-repack-size=5G

Repo size after running these commands is 2920.5 GB. Basically after enabling maximum compression I’ve got an extra 4MB. This does not make sense to me.

What have I done wrong?

Of course it’s possible that log files are already compressed with logrotate, jpeg’s are also compressed well enough, database backups are in psql binary format and everything else just does not compress that much.

fd0 · August 27, 2022, 2:51pm

I’m not sure, but as far as I can see the repo afterwards is 4.3 GB smaller than before, so you got a reduction in size.

I also have an idea why it isn’t even smaller: in order for restic to compress data in an existing repo, it first needs to download the data, compress it, and then reupload it again. That’s exactly the same process (called “repack”) which happens regularly during prune (just without the compression in older releases before 0.14.0).

In your case, you instructed restic to at most repack 5GB, so I suspect that’s the reason why you don’t see more reduction: you just told restic to re-upload at most 5GB. So restic downloaded roughly 5GB, compressed it, and uploaded less than 1GB

You can try again with restic prune, with a larger --max-repack-size. Or you could (over time) run restic prune regularly, and over time all data will be compressed eventually.

Please report back!

jarm0 · August 28, 2022, 8:17am

@fd0 thank you for your speedy reply!

Thanks for noticing that it was 5G and not 5M as I thought with my invalid calculations

I guess I have misunderstood --max-repack-size - I assumed for some reason that it means basically something like “do not compress any files, which are larger than specified max-repack-size parameter”. Reading your reply I understand that it is a setting meant for gradual compression of data retroactively. Is that right?

How would I compress whole repo in one prune? Should I just not specify --max-repack-size at all? Or should I just set it to something really large like 10T?

fd0 · August 28, 2022, 9:10am

Correct. During prune, the repo is locked for all other operations, so you cannot run a backup while prune is running. Our idea was to give users the option to split prune into several smaller runs, this is especially relevant for people who have low upstream bandwidth. They can use --max-repack-size to limit how much data is down- and then re-uploaded during a sinlge prune run, so restic will still clean up, but over time instead of all in one go.

Just run restic prune --repack-uncompressed. This will mark all uncompressed data as “to be repacked”, and without a limit on that restic will compress everything.

MichaelEischer · August 28, 2022, 9:17am

Although for a multi-TB repository I’d recommend to repack it in a few steps. Otherwise if prune is interrupted for any reason, you’d have to start from the beginning.

jarm0 · August 28, 2022, 9:28am

That is actually a really thoughtful feature. For my current repository I can wait it to take forever until compression has been done. So no problem that backups do not work in the meantime. However, for some of my more critical repositories I would definitely use this feature to allow backups to be done in the meantime.

Few questions though:

Is it okay to ctrl+c while prune is in progress or is there a chance of a repository corruption?
If using --max-repack-size then how do I know that I can stop running prune since everything has been compressed already from the past?
As I understand then I need to use --compression max for the prune command each time and also for the backup command to get the benefit of maximum compression. What about forget with --prune after everything has been compressed previously already by prune without specifying --max-repack-size? Is there any other commands I should use --compression max flag with since it is listed under global flags?
When running forget with --prune does that also compress data from the past if ran periodically or does it just remove unused data and after a while there will be just newly compressed data since all the old data has been removed?

fd0 · August 28, 2022, 9:39am

What’s not explicitly mentioned (but maybe we should add that): for safety reasons, restic prune will first upload all newly compressed data, and only start removing the old data at the very end. So if you have a repo of 3TB and only 500GB left on the hard drive, it will likely fill up the drive and fail before all is compressed. If you split this into several runs with e.g. --max-repack-size 400GB then you can compress everything despite the space limitations.

It should be safe, although your drive (if you use one) will fill up and you have to start prune at the beginning again.

It will print some stats during the run, one of them is the “repack size”. If that’s zero, then you’re done.

Good question: every command that uploads data will check that flag (backup and prune mostly), the others do not care. You can also set $RESTIC_COMPRESSION instead of passing the flag.

The latter: by default restic prefers to keep data that’s already stored at the repo to leave it as it is. Over time more and more data will be compressed, as uncompressed data is removed by forget and prune.

The idea is that you can migrate your repo and instantly benefit from compression (less new data is uploaded) without having to wait until all data is compressed in the whole repo.

jarm0 · August 28, 2022, 10:20am

By “hard drive” you mean the repository not the local hard drive on the system where restic is ran? Since I’m using Backblaze for the repo then I should not have this problem, correct?

Maybe it would make sense to move that flag under backup and prune then instead of having it listed as a global flag?

Makes sense. Basically, running prune with --repack-uncompressed does only make sense when forget with --prune is never rand and data backed up in the past needs to be compressed. Or if you want to compress your whole repo without wanting to wait until forget and --prune “catch up”.

Thank you for all the insightful answers. Hopefully they help someone else too to understand compression feature a little more.

fd0 · August 28, 2022, 11:27am

Exactly, that should not be a problem

Ah, hm. That’d require restructuring

MichaelEischer · August 28, 2022, 3:48pm

As long as it is a global flag you can just pass that flag to every command. If it is specific to some commands, then you need to know exactly which ones upload data and which don’t. For example at least forget --prune, tag and recover also need the flag.

rawtaz · August 28, 2022, 6:06pm

@jarm0 Feel free to mark the reply in this thread that you feel is closest to answering the original question/problem as the solution

jarm0 · August 31, 2022, 4:59pm

Final result after compressing whole repo. Before: 2924.8 GB. After: 617.1 GB.

Process took time approximately 48h.

Good job!