Backing up data supplied by an attacker

Eli6 · April 6, 2022, 7:29pm

Does restic’s security hold up if the client backs up some data supplied by an attacker, mixed up with the client’s data?

Update

To learn about this type of attacks, see TODO section here on attacker-supplied data:

(A decade ago, NSA was decrypting VPN traffic using variations of such techniques).

rawtaz · April 6, 2022, 9:30pm

I’m sorry to be blunt, but this question doesn’t make much sense. Please rephrase it so it’s more specific, if you are genuinely wondering about some security related aspect of what data is backed up. Generally speaking restic doesn’t give a foo what data you back up, hence the question being rather moot.

akrabu · April 7, 2022, 4:26pm

I mean, Restic just backs up data to a repository. It doesn’t go around launching random executables. You could back up a virus if you wanted to. It would store it in the repository with everything else. Viruses are only a problem if you execute them. So you’d literally have to restore the virus, and purposefully run it, before anything could happen.

In short, Restic doesn’t care what you back up. And it’s not going to randomly launch programs, so you’re as safe going into it as you were before you backed up. Should someone then actually launch the virus from the computer, and it screws things up, you’d be able to use Restic to restore after wiping the machine and it would restore everything. Including the unlaunched virus which, hopefully by then you’d recognize and just delete. Again, viruses can’t do anything if you don’t run them. It’s just a file like any other.

Eli6 · April 7, 2022, 4:34pm

Hello!

I think you are not familiar with security, and didn’t get the question. Look up chosen plaintext and similar types of attacks (it’s quite well known and serious). I also added an update. The question is already clear.

Let’s see if Michael has something to add.

rawtaz · April 7, 2022, 5:16pm

Your question is still moot though. You are asking if restic’s security will hold up, which is the same as “is it secure?”, since your question is a yes/no type of question.

The answer is obviously not “no”, because if it was, then there would be a known vulnerability in restic, and that is simply not the case.

At the same time, no one person in this world can with 100% certainty tell you “yes” either, because no software in this world can be 100% known to be completely bug-free and “secure” in the sense that one can answer “yes” to that question. Giving you that answer would be equivalent to saying that there is no possibility of a vulnerability or bug in restic (and the same would be true for any other software as well - this is not restic specific). So, not a possible answer.

The closest is “as far as we know, there are no known attacks that would break restic’s security”. But you didn’t ask a question that lends itself to such an answer. Hence your question comes across as more of a trolling question than anything else, to be honest with you. If on the other hand you were to be more specific about what you were thinking of and wondering about, and not asking such a definite question, it would be different.

It’s just a matter of how we read each others text I suppose. It’s clearer now what you meant, when you were more specific.

By the way, I suppose you already saw the documented threat model: References — restic 0.13.0 documentation - It should cover the general take on how secure restic is against various types of attacks.

alexweiss · April 8, 2022, 4:58am

Sorry, your question is far from clear to me. Can you please explicitely specify the attack you are thinking of and the aspect of security you are wondering this attack could compromise?

Thanks!

Eli6 · April 8, 2022, 7:47am

Sure!

Let me provide an simple example how chosen plaintext attack works. Microsoft sees you have a restic repository in OneDrive that is updated frequently on their servers. They wonder what you have in there. They suspect you might hold a copy of their latest version of Windows illegally. Next time you search internet using their search engine or browser, they let a signed copy of Windows download silently in background. It goes into your download directory, and next day you run Restic backup ~\home and gets backed up. They then analyze changes in the repository; surprise, only few 5MB pack files are added. Microsoft sends a letter to law enforcement that Alex provably
has had a copy of their product, since their input got deduplicated.

Well, how about we extend this and use it as a client side search tool. User has connections with foreign states, or is suspected of CSAM, how about we search his computer for known images (for example, a Chinese user storing the Tank Man or other ideological photos can be red flagged) . How about we search for existence of known messages, or test passwords on disk etc. Or, it could be just a sequence of PDF files added to the source and observing ciphertext over time (of course, new DEKs are chosen for pack files, but now you have both plaintext and ciphertext and can attack the master key, or cross correlate with other pack files).

You get the idea: attacker has some control over plaintext, for example may add, subtract or change data, even though it doesn’t know part of data that comes from the user.

And this is why Filipo complains that restic should state its threat model more carefully on its website. It’s not true that the server can be untrusted and confidentiality is preserved. It depends on other conditions. Filipo notes fingerprint watermarks, verification of polynomial chunker algorithm and suggests a TODO. I wonder if restic developers followed up on yellow flags noted in above review.

By the way, this is explained in Wikipedia page I mentioned and easily findable searching keywords from article I linked. The title is also rather clear: backing up attacker-supplied data!

alexweiss · April 8, 2022, 8:37am

About your concrete example:

This is in theory possible but does not depend on any concrete implementation detail of restic, but instead on the mere fact that we have a deduplicated backup. So, you can change crypto or the chunker in any way you want and won’t get rid of this theoretical problem. The only remedy is to not deduplicate.

Here are some reasons why your “attack” wouldn’t practically work - given that Microsoft is not able to break AES:

Where do they get the information that your backup comes from your host and really did backup /home? This information is completely encrypted and could be even faked…
What if /home is backup’ed but /home/Downloads was excluded - again this is all encrypted and maybe not even saved at all?

About the yellow flags, I think there are two:

The crypto seems to be ok, but somewhat non-standard. (This is mainly because of the uncommon combination of AES-CTR with Poly1305-AES, all components are in fact standard). Changing this would involve adding the possibility to choose the crypto
and also re-building the repos you want to change the crypto.
The chunking which might theoretically allow chosen-plaintext attacks to get information about the repo. The threat is basically not your example above, but the question if it is possible to decrypt (parts of) data only by knowing chunking details but without breaking AES.

Here I also clearly agree to the yellow flag. Indeed, I think it might be well (theoretically) possible to get information about or even uncover the chunking parameter if you are able to do arbitrary chosen-plaintext attacks. That in contrast could theoretically reveal information about the data saved in the repo.

But I think we must also relativate here about the attack vector: To perform such an attack the attacker must be able to control the whole process from entering data to observing the changes in the repo and be able to do this often. In my opinion this is only feasible if you control the host which runs the backup or know the key to the repo. In both cases I would assume some theoretical attack on the chunking parameters are not the main problems…

Eli6 · April 8, 2022, 7:37pm

Thank you very much for your helpful clarification!

The chosen-plaintext attack could become practical in situations where the client and the untrusted server interact with each other over time.

Consider this concrete example. The user has a Dropbox folder in the home directory. Dropbox thus could add arbitrary data to user’s home. The user also backs up his home directory using restic to a cloud provider, say AWS.

The government gets interested in user and issues subpoena to cloud providers. The case is referred to an analyst at FBI who knows about the chosen-plaintext attack. The analyst injects arbitrary data to host via Dropbox and observes the output ciphertext in Restic’s repository on aws. The analyst could repeat his attack over time as needed.

Could you comment if in this example, the analyst will succeed in unlocking Restic’s repository?

The answer to the original question becomes clear in the context of this example.

MichaelEischer · April 8, 2022, 9:09pm

If you mean “decrypt arbitrary parts of the repository” by “unlocking” then the answer is clearly no. At least unless your imaginary analyst has the capability to break AES-CTR. What could be possible is to use deduplication as a sidechannel to learn in a very limited fashion about whether some data is part of the repository or not.

However, that is severely limited: an attacker would have to correctly guess complete file chunks (about 1MB in size) or files. Only if the full chunk exactly matches what is stored in the repository, then it will be deduplicated. There’s also an additional problem: restic packs together multiple chunks into a single pack file such that it is usually not possible to just correlate blob sizes with pack sizes. (An attacker can calculate how many blobs are stored in a pack file, however there is no indication about the size of each individual blob. Well, at least as long as more than one blob is contained in the pack file).

There is also another problem: the backup size would have to increase by a sufficient amount to provide a clear signal. Just checking whether a file with a few MBs exists probably drowns in noise. In addition, such an attack would attract attention, as it causes the backup size to drastically increase (unless someone already knows exactly what they are looking for, in which case there’s not much point in searching for it).

alexweiss · April 9, 2022, 8:13am

The ideal chosen plain text attack adapted to restic would go like this:

Add exactly 1 known file with a size between 512kiB and 8MiB. Then backup and check if this added 1 or 2 chunks to the backup.
Repeat this very often (million times) with different input.
This then would allow you to “break” the unknown chunker polynomial
(In fact I don’t know enough about the Rabin fingerprint to estimate how hard it is to bruteforce this, but I’m very sure that we are talking about millions of backups to be performed)

Now you know the chunker polynomial. This gives you extra information about chunk contents for chunks greater than 512kiB. Like this byte can’t be xx because if it would, the chunker would have divided this here into parts.

And this, in theory, could allow you to generate better brute-force algorithm to break your AES-CTR decryption. Again, I’m no expert about AES, but my guess would be that this could ease brute-forcing but will still make it pretty hard to “break” the AES.

Now, some comments:

As already told, this “ideal” attack would mean to either have access of the key (which would already allow you to simply decrypt the repository) or to have access to the machine backing up (where usually the next steps would be to get the key or read the data on the system and not to try breaking a chunker algorithm…) In general these “chosen plaintext attacks” are much more feasible on network protocols and similar and much harder on backup systems due to the nature of the process.
In any case, even if you have “broken” the chunker algorithm and are able to use this extra information to break AES, you are then only able to decrypt the contents of each chunk starting from 512kiB. This is because the first 512kiB of each chunk are always added and no Rabin fingerprint is here taken into account. With other words: All files with size less than 512kiB (and for larger files some “512kiB-blocks”…) are still save
Your examples were all not about decrypting contents of the repo but of checking if some known content is available in the repo. This is of course much easier. And of course, the information gained can give you a much better answer to such a question. However, if you are able to do such kind of attack it is much easier to simply add the known content, backup, and check if nothing was added to the repo. And as already told, this procedure does not depend at all on the chunking or the encryption, but only on the mere fact that we are deduplicating.

MichaelEischer · April 9, 2022, 8:57am

As the chunker algorithm splits chunks when the value of the fingerprint is below a certain value, that will somewhat complicate the amount of information gained each time a chunk is split. The fact that the 64bit fingerprint is calculated over a 64bytes block doesn’t help either.

It will make it perhaps a bit easier to correctly guess which ciphertext matches to one of your chosen plaintexts. Without that knowledge it won’t even be possible to run a brute-force attack (which for AES-256 is pointless anyways). But AFAIK AES is resistant to chosen plaintext attacks, such that this won’t help in any way breaking the encryption keys.

As mentioned before the rabin fingerprint compresses 64bytes into 64bits for which the chunker checks that a certain amount of leading bits is zero. So the information leakage is fairly limited. Essentially the only thing you can tell is that the last 64bytes of a chunk result in a small fingerprint value (and that the other parts didn’t, but that will only reduce the possible plaintext by a fairly small percentage). However, with that knowledge it only reduces the possible plaintexts from 2^512 to something above 2^448 (that number is a serious underestimate, as it assumes that the whole fingerprint has a known value). Without breaking the AES key there’s no chance to tell which of these possible plaintexts is correct. So knowing the chunker polynomial won’t give your data away to an attacker.

Eli6 · March 22, 2025, 11:48am

There you go!

Chunking attack on backup software including restic:

That’s similar to the Chosen Plaintext Attack I mentioned here.

It will be great if this could be patched, and explained.