How to check if files were compressed?

adamhl8 · August 30, 2022, 9:23pm

Is there a way to check if files were compressed at a given compression level other than comparing the size of the repo before/after enabling compression? I set compression to max via environment variable before running my backup, and I want to make sure it actually applied.

Thanks!

kwinsch · September 26, 2022, 9:58am

Yeah. I currently have exactly the same question. I have migrated a v1 repo to v2, set the “RESTIC_COMPRESSION” env to “max” and did a re-compress using the prune subcommand using the " --max-repack-size" option. I found no way to proof, if restic has really used the max compression setting, or if the compression was even used.

Would it be possible to extend “restic stats” to include information on compressed blocks? Method and compression level would be nice.

rawtaz · September 26, 2022, 11:03pm

Interesting question. I am not aware of a way to see this, and honestly I don’t think restic stores information about which level of compression was used in the snapshots, but I could be wrong!

If it’s not simple or possible to get that information out of a stats command, what do you peeps think about restic simply outputting which compression level is used, during the backup run?

On a note regarding use case - why are you pondering this in the first place? Has there been an actual problem of some sort?

alexweiss · September 27, 2022, 4:54am

AFAIK, the compression level is merely a parameter how much “work” the zstd algorithm invests to produce the compressed data, whereas you cannot determine from the result (or by a decompression algorithm) how much work was invested.

restic does not store any information about the compression level. So this simply cannot be shown or determined (except by using the uncompressed data for each blob and trying to recompress using several compresseion levels).

However, the following information is available per blob:

whether the blob is compressed or no
the uncompressed size
the compressed size

akrabu · September 27, 2022, 4:29pm

I’d love to be able to do a “restic stats --mode raw-data” and “restic stats --raw-data-uncompressed” or something to that effect, to be able to see the compression’s effect.

I’ll often tell my users when restic dedupes say, 500GB to 250GB, so they know they need to run a duplicate finder. But now the results are a little exaggerated, unless --mode raw-data actually tells the uncompressed size? I’m not quite sure how it works anymore.

MichaelEischer · September 27, 2022, 5:37pm

Please have a look at restic stats: print uncompressed size in mode raw-data by plumbeo · Pull Request #3915 · restic/restic · GitHub . Does that PR add the information you’re interested in?

raw-data is the compressed size for a V2 repo.

akrabu · September 29, 2022, 5:41pm

Ahh perfect, I’ll be following that one.

Cool, I thought so, but I wasn’t entirely sure. Thanks!

kwinsch · October 3, 2022, 5:36pm

I think that for the stats sub-command, the output of uncompressed and compressed data is enough. For actual backup, re-packing etc. where the variables like “RESTIC_COMPRESSION” apply, it would be anyway good to have the ability to see if and what compression level got picked up by the command, at least if a verbose option is used. Same problem might apply to all ENV variables.

The use cases definitely are related to migration. The user want to see, if the compression level selected was correctly applied. I searched a little bit in the source code what actually gets stored in the repo. It seems that the used compression algorithm and level is not stored. So if an archive was re-packed with “auto”, it can not be automatically upgraded to “max” later. Same if in the future an additional compressor gets added, it will be difficult to support without repo changes. In most cases simple re-upload is easier anyways, so no big deal in the end.

Thanks for the work. The feature in general is really awesome and restic is now really the program of choice!

betatester77 · October 6, 2022, 11:50am

For me the output of a verbose run is okay:

Added to the repository: 7.960 GiB (2.877 GiB stored)

thedaveCA · October 9, 2022, 4:28am

I stumbled across this thread today as I am finally upgrading a couple more repositories to v2 and considered the type of data I store here.

On my mail servers I had two types of data:

The email itself, one message per file (roughly maildir), these tend to be quite compressible as email is often plain-text.
An archive of historical email, written in a binary form that is deduplicated more aggressively than restic (operating at the MIME part level to deduplicate between different messages, and it understands 7-bit encoding can be written to disk to use all 8-bits), compressed, and then encrypted at rest. The files are (mostly) immutable like blobs.

I was considering if I should try to set the archive backup job to --compression off to save some CPU cycles during the backup as this server is a bit CPU starved, while letting the email content get compressed. But I don’t think there is any point because as soon as I prune a blob the content will all get re-compressed with the same setting.

If restic recorded the compression originally selected it could re-use that same setting when re-compressing the blob later. But I suspect this would complicate the prune operation rather significantly to take into account what type of compression is desired for each blob file as a whole.

Or maybe I’ll just move my prune operations to a machine with more bandwidth and no CPU constraints, let it prune with --compression max and waste all the time it wants trying to compress the uncompressible without hurting anyone, all to squeeze a bit more cloud savings from the compressible email content.

I suspect I am over-thinking this vs the real-world savings for the amount of data and splitting it into separate repositories might be the way to go.