Status of the compression feature

lhannigb · July 12, 2019, 12:20pm

Hello restic contributors/community,

I would like to know what the current status for the compression feature is. I created this post in order to not add more comments to Github Issue #21 as @fd0 asked not to do so.

Unfortunately the Staus-Raodmap Wiki was not updated since 18th October 2018 and the comments at the issue are quiet old (except the last four comments). In the near future i would like to use restic as a replacement for an old Obnam setup and it would be beneficial to use a version with compression feature or to know that it is under development. So it would be nice to get a little update on this feature status.

I’m not asking for an explicit deadline or sorts of, but for a status like, for example “Is currently worked on” or “is coming soon”.

Thank you for your answers.

764287 · July 14, 2019, 2:31pm

AFAIK, development for a compression feature hasn’t started yet (at least not in the official Github repository).

You should try restic out before you make compression a requirement. Not all data benefits from compression.

Some time ago I’ve extensively tested and compared Borg Backup and restic. It turned out that the respository size of restic and Borg Backup are about the same for most of my data sets, even though compression was enabled for Borg Backup.

lhannigb · July 15, 2019, 9:50am

Thanks for your answer.
I already compared restic backups with the backups created via Obnam and BorgBackup. The size of the repository was about ~400MB while the ones from obnam and Borgbackup with compression were about ~115MB. So in my use case compression would be beneficial.
But in will think about it anyway.

yomaaha · July 21, 2019, 1:24am

Interesting, Why would that be?

cfbao · July 21, 2019, 2:30am

Many common file types (e.g. most image/audio/video files, MS Office documents, some PDFs…) are already compressed, and can hardly be compressed further.
Typically compression only makes a big difference if you have tons of pure text files.

lhannigb · July 22, 2019, 9:58am

Having tons of text (files) is part of my use case.

wilk · August 17, 2019, 9:40am

I also use borg and gain a lot with compression (mostly code projects).
I’d like to know what make so hard to add compression ?

lhannigb · August 19, 2019, 7:07am

I got to know that it is more problematic to change the repository format than implementing the compression.
And with this i fully understand why it takes some time to implement this change. Nobody wants their repositories and backups broken or incompatible.

sebastien.gross · August 19, 2019, 10:27am

For now dedup works pretty well.

For the record I migrated my old BackupPC archives to restic (see https://github.com/renard/backuppc2restic for details). BackupPC uses both compression and deduplication. I have about 47Tb of raw data in 1.5Tb of effective data with BackupPC. It takes something like 1.7Tb with restic.

This is good enough for me.

CoolCold · December 12, 2020, 6:10pm

Use case I care about/need to implement:
Having dev environment with Gitlab / Jenkins and so on. While Gitlab has it’s own backup feature, which uses compression, Jenkins and other dev tools/sites have not, so making some kind of tar.gz greatly improves size.
This is important for offsite backup, both for transfer speed (i.e. finish backup faster) and space used.

lgrn · June 29, 2021, 6:28pm

Has anyone tried creating a .tar.gz or similar of the entire directory being backed up, and then only backing up the compressed file instead of the original directory? It seems that gzip is actually deterministic in terms of hashes, and from my testing it doesn’t change the mtime of individual files either. From my understanding of how restic works, this might mean that restic is able to recognize the deterministic similarities between two compressed files with an identical name, and avoids duplicating identical data?

technicalguy · September 24, 2021, 10:27am

I’m very keen for this feature! Posting here to show my support and to not spam GitHub.

Related: Does Restic take donations? Would fd0 get a GitHub sponsorship or patreon? Would a bounty source for this feature be appreciated? I’m keen to put my money into great open source software!

Eli6 · September 25, 2021, 3:23am

Compression opens door to some attacks and may not be good for security.

huyz · September 25, 2021, 7:17am

Compression opens door to some attacks and may not be good for security.

Eli6, can you ELI5?

Eli6 · September 25, 2021, 8:07am

See Chosen Plaintext Attack, such as CRIME

NSA apparently decrypt a lot of traffic using this attack on TLS.

Admittedly, it’s relevant to interactive environments, less in back up. But I have concerns that with backups also users interact with remote servers in various ways.

MichaelEischer · September 25, 2021, 10:28pm

I don’t see how this attack could apply to a compression feature in restic. For compression restic would probably split files into blobs, compress those and then assemble larger packs from them. In that case blobs from different files would be compressed independently, which prevents info-leaks across blobs. And as restic doesn’t run code controlled by an attacker, there’s no place where incrementally choosing a plaintext would happen.

Eli6 · September 26, 2021, 8:16am

Here is an example.

I have a Dropbox account and folder /home/Dropbox. I encrypt the whole /home directory and back up to Dropbox.

Dropbox puts a file, let’s say, a copy of the Windows image into my Dropbox directory. Dropbox then measures the size of new blobs that are added on their servers. If the size has not changed, Dropbox infers that a copy of Windows exists in my computer, namely, Dropbox recovers one of my files. Repeat for other files, images, texts, sentences, messages, blobs, etc. The attacker needs to be able to add or drop plaintext (or otherwise have some control over plaintext), and measure the size of ciphertext.

It’s client side scanning like the one proposed by Apple, with crypto prepared by restic!

In this case, dedup works somewhat similarly. But I suppose deduplication does not replace compression, which is why compression actually further reduces repository size.

Now this is from some random Joe. Imagine what sophisticated attacks NSA could do.

Once you interact with a cunning adversary, the features you include in your software can be opportunities for the adversary.

MichaelEischer · September 26, 2021, 12:14pm

That sounds like a circular dependency. Why would you encrypt data from the Dropbox and add that to your encrypted /home backup?

The attack you’ve described works on the deduplication level. For it to work in restic an attacker would have to guess a complete blob correctly, which has a size of about 1MB or the whole file if it’s smaller. That is it won’t be possible to incrementally guess secret tokens as in the CRIME/BREACH attacks. If an attacker is able to add files to the folders you are backing up, then it’s not really possible to avoid the information leak sidechannel as its essentially inherent to deduplication. However, in that case you should start questioning how trustworthy the host creating the backup is. Whether or not individual blobs are compressed doesn’t matter at all here.

Eli6 · September 26, 2021, 12:37pm

I am not sure if you understood the attack.
You don’t search for or guess anything.

Read the class of chosen plaintext attacks, and the role of compression.

Think of it as client side scanning. I have a database of child abuse images and my task is to determine if you have these images in your computer.

MichaelEischer · September 26, 2021, 7:37pm

For client side scanning there’s no need to use compression. Just provide a list of known “bad” file hashes and check whether a file matches one of those. (One can of course add lots of extra cryptography to improve privacy and reliability). I’m aware that this is closely related to how the deduplication part of restic works. (see below)

The chosen plaintext attacks mentioned initially (CRIME/BREACH) are actually adaptive chosen-plaintext attacks and rely on compressing both secret data and attacker-controlled data in one go. The latter part is what introduces the information leak and the former part is necessary to iteratively leak larger and larger parts of the secret data.

What I don’t understand is how that attack should be relevant to restic. The suggestions to implement compression in restic would work on the blob level (when deduplication has already happened), that is on a chunk of a single file, which means that there’s no mix of secret and attacker controlled data at that level. And thus no compression based information leak. The adaptive part also won’t work as any modification of a file chunk changes its hash and, unless it exactly matches some existing other blob, it is uploaded again.

What you obviously can do is store a file on a backed-up computer and then look for whether it was already stored in the backup repository based on how large the data uploaded for the new snapshot is. That’s essentially what you’ve described before with the “Windows image in Dropbox” example. But this sidechannel exclusively relies on deduplication, and has nothing to do with whether individual blobs are compressed or not. The difficulty for an attacker here would be that he has to correctly guess a whole file (or a large chunk) for deduplication to happen.

So are we maybe assuming “compression” to mean slightly different things? In the context of the compression feature for restic it is about compressing individual file chunks after chunking. If the deduplication step of restic also counts as “compression” then yes compression can introduce information leaks.