How are blobs deduplicated with encryption?

I’m confused about how blob deduplication works with encryption. Does the same plaintext chunk always produce the same encrypted blob / SHA256 sum? Meanwhile, the blob / general encrypted data format is: IV || cyphertext || MAC - if the IV is unique per blob and not a function of the plaintext, wouldn’t the same plaintext chunk produce different blobs?

Deduplication obviously works, so I’m wondering where I’m misunderstanding here. Thanks!

Deduplication happens before the encryption. A blob which exists encrypted will not be encrypted again, but just referenced by its id (which is the sha256sum of the plaintext).

So is it always the case that the sha256sum filename of any file in a repository is the hash of the plaintext content, not the encrypted file itself? Just to make sure

I suggest you study these two links - restic is open source - so all is documented (better or worse but nothing is secret)

and

https://restic.readthedocs.io/en/latest/100_references.html#terminology

2 Likes

I’m still confused… here’s a quote from the design document:

For all other files stored in the repository, the name for the file is the lower case hexadecimal representation of the storage ID, which is the SHA-256 hash of the file’s contents. This allows for easy verification of files for accidental modifications, like disk read errors, by simply running the program sha256sum on the file and comparing its output to the file name.

So the storage ID / filename is the SHA-256 hash of the file’s literal contents, including cyphertext. How can deduplication be performed before encryption if the hash is computed after encryption? Does deduplication use a method other than “produce blob, compute checksum, see if checksum is already in repo”?

Yes - but it is has no connection with deduplication. It is only convenient files’ naming scheme used for verification. Encrypted envelope with unencrypted data hashes inside.

Information about real content chunks hashes (before any encryption) and their location in blobs is stored in index files. restic hashes new data chunk and checks in index if it is already stored - so can be reused by only using new entry in index. This is how deduplication happens.

Also note that one blob can contain data from multiple chunks and from multiple files. This is why in actively used repository you should run pruning to compact blobs by removing data of unused chunks. Effectively repacking blobs to recover unused space.

Thanks, this is definitely starting to make sense!

Also note that one blob can contain data from multiple chunks and from multiple files. This is why in actively used repository you should run pruning to compact blobs by removing data of unused chunks. Effectively repacking blobs to recover unused space.

By “blobs and chunks”, do you mean “packs and blobs”? I thought that blobs are the units referenced between files, and packs are the storage containers for multiple blobs. Previously I was using “chunk” to basically mean a raw slice of a file before being packaged as a blob.

Also, as blobs are only referenced by their location in a pack file and plaintext content hash, and this is only available in packfile headers and index files, this means that blobs are always stored in pack files and never in their own files right? The latter would mean you’d need to lookup the file by encrypted file hash so I assume that’s not possible.

Yeah maybe my terminology is not the same as used in restic but clearly now you get it how deduplication works:)

The index files contain the necessary lookup from plaintext content hash of a blob to the pack file name and the corresponding file offsets. For a backup and other operations, restic just loads the whole index into memory and is then able to efficiently locate the storage location of a blob.