Status of the compression feature

Eli6 · September 26, 2021, 8:22pm

Data compression maps a sequence of input characters to a sequence of output characters that is no longer than input. It exploits the empirical probability distribution function (PDF) of the input sequence, storing characters with higher frequencies with fewer bits, and thereby mapping a non-uniform input PDF to a uniform output PDF.

Deduplication is a special case of compression, where the alphabet is chunks (rather than characters or bits). It maps a sequence of input chunks to a sequence of output chunks with uniform PDF (ie, all chunks occur with equal frequency 1/n, namely chunks are unique).

Yes, there is information leakage with deduplication, under the threat model described above.

The relevance of this threat model is subjective (and in my view not applicable to most users). Restic, however, should account for broad set of threat models.

I suggest adhering to design guidelines set by the Restic’s founder:

Security should be a first class citizen, and will be increasingly important in cloud computing moving forward
Restic does not have to account for all use cases. Do few things and well, a la Wireguard
Restic should not provide the option of no encryption (or encryption with a dummy password)
Restic should be careful of software supply chain attacks (Go is a good choice here). Governments and groups are turning to this approach

At the end, do we need compression with cheap storage and falling prices? I can’t tell.

How many lines of code, software complexity and repository fragility is it going to introduce?

MichaelEischer · September 26, 2021, 8:42pm

You might want to look at the threat model at References — restic 0.16.3 documentation . As the information leak we’ve discussed is already caused by deduplication, I don’t see a reason not to compress blobs in the repository as this won’t add a new information leak.

Compression should help a lot with the size of the metadata, my guess would be at least a factor 2 or 3.

Nev · September 27, 2021, 5:57am

I think @fd0 has pretty clearly spoken about the project’s conclusions in this area, after long previous discussions in Github: yes, to optional compression, when resources are available.

This thread is very clearly defined (both by the OP and subsequent bumps) as a request for a status update, and an additional request for whether there are alternative ways that users can support the development of this feature.

Personally, I’d be interested to hear an authoritative answer to these questions when the time is right.

I wonder therefore why every compression-status question seems to get diverted into a discussion on attacks? How about people interested in that start a new thread on the issue, and leave this one for its intended purpose, or unanswered (if the devs have nothing to say right now)?

kwinsch · January 20, 2022, 6:59am

Thank you @Nev for the conclusion. At my side, there are new use cases, that greatly reduce cost, if compression would be available. I think we should remind us, that this is an open source project and developers spending probably some free time to code, which is much appreciated! Because I would spend less in cloud storage, if the feature would be available, I consider spending some money to get it done. Would it be possible to set up a bounty-program for such things? I would start with adding $200 USD to the pot.

There are several things I would like to see. I think a lot can be learned from the ZFS approach, which is not good, but great.

They can use different compression methods and levels per block.
They have an early abort mechanism for blocks that do not compress well

So text files and other uncompressed stuff get always compressed. If the fs is encountering an LZMA stream, it early aborts and just store it as it is. They are also in the process to implement adaptive compression level. If the system can handle higher levels without slowdown, do it, else reduce the compression unless forced by the user. So every user can decide if they want to sacrifice time for size.

Anyway, great software.