Status of the compression feature

Data compression maps a sequence of input characters to a sequence of output characters that is no longer than input. It exploits the empirical probability distribution function (PDF) of the input sequence, storing characters with higher frequencies with fewer bits, and thereby mapping a non-uniform input PDF to a uniform output PDF.

Deduplication is a special case of compression, where the alphabet is chunks (rather than characters or bits). It maps a sequence of input chunks to a sequence of output chunks with uniform PDF (ie, all chunks occur with equal frequency 1/n, namely chunks are unique).

Yes, there is information leakage with deduplication, under the threat model described above.

The relevance of this threat model is subjective (and in my view not applicable to most users). Restic, however, should account for broad set of threat models.

I suggest adhering to design guidelines set by the Resticā€™s founder:

  • Security should be a first class citizen, and will be increasingly important in cloud computing moving forward

  • Restic does not have to account for all use cases. Do few things and well, a la Wireguard

  • Restic should not provide the option of no encryption (or encryption with a dummy password)

  • Restic should be careful of software supply chain attacks (Go is a good choice here). Governments and groups are turning to this approach

At the end, do we need compression with cheap storage and falling prices? I canā€™t tell.

How many lines of code, software complexity and repository fragility is it going to introduce?

1 Like

You might want to look at the threat model at References ā€” restic 0.16.3 documentation . As the information leak weā€™ve discussed is already caused by deduplication, I donā€™t see a reason not to compress blobs in the repository as this wonā€™t add a new information leak.

Compression should help a lot with the size of the metadata, my guess would be at least a factor 2 or 3.

I think @fd0 has pretty clearly spoken about the projectā€™s conclusions in this area, after long previous discussions in Github: yes, to optional compression, when resources are available.

This thread is very clearly defined (both by the OP and subsequent bumps) as a request for a status update, and an additional request for whether there are alternative ways that users can support the development of this feature.

Personally, Iā€™d be interested to hear an authoritative answer to these questions when the time is right.

I wonder therefore why every compression-status question seems to get diverted into a discussion on attacks? How about people interested in that start a new thread on the issue, and leave this one for its intended purpose, or unanswered (if the devs have nothing to say right now)?

6 Likes

Thank you @Nev for the conclusion. At my side, there are new use cases, that greatly reduce cost, if compression would be available. I think we should remind us, that this is an open source project and developers spending probably some free time to code, which is much appreciated! Because I would spend less in cloud storage, if the feature would be available, I consider spending some money to get it done. Would it be possible to set up a bounty-program for such things? I would start with adding $200 USD to the pot.

There are several things I would like to see. I think a lot can be learned from the ZFS approach, which is not good, but great.

  • They can use different compression methods and levels per block.
  • They have an early abort mechanism for blocks that do not compress well

So text files and other uncompressed stuff get always compressed. If the fs is encountering an LZMA stream, it early aborts and just store it as it is. They are also in the process to implement adaptive compression level. If the system can handle higher levels without slowdown, do it, else reduce the compression unless forced by the user. So every user can decide if they want to sacrifice time for size.

Anyway, great software.