Detect hash collisions

kuzmandi · October 15, 2021, 12:03pm

I know, we have had the topic here and there:

What happens if a blob of an new added backup results the same checksum as a blob that exists already in the repo? Restic would incorrectly de-duplicate the new data and in case of restore, incorrect data would be written (unnoticed, in the middle of some file).

It all boils down to extensive discussions about probability. But to really validate a backup, it is currently necessary to restore it after completion and compare it byte-by-byte.

I would like to take this up again. Would it make sense to add a validation parameter, which automates this with a dry-run restore-and-comparing after completion if a backup?

In a further development step, maybe the affected file could be read again and it’s blobs modified in some way (maybe simply divided) to fix the problem.

What do you think?

rawtaz · October 15, 2021, 12:23pm

It is indeed a discussion about probability. Have you seen how low of a probability there is that there would be a collision? Do you realize that the widely used Git has the same type of “issue”? If yes, are you proposing a workaround for Git too? If not, what makes it acceptable there but not here?

My personal opinion is that this is a non-issue in all practical aspects. Do you have a use case where this is an actual problem?

Finally, I’m not trying to be snarky, even if it may sound like it

kuzmandi · October 15, 2021, 12:45pm

No problem, I just want to discuss it

Even if you disregard the matter of the possibility of duplicate checksums, it can make sense for various reasons to check a backup completely after it has been created. So then why not also byte-by-byte? Especially backup software with the aim of “doing it right” (so everyone can perfectly trust it) should have the “paranoia option”, I think.

kuzmandi · October 15, 2021, 12:50pm

I want to use restic for de-duplicated archiving, so I want to make sure it really contains what it pretends. Of course, I can do the byte-by-byte check manually or by a script. But the best method - I think - would be if restic would support checking byte-by-byte itself.

Eli6 · October 16, 2021, 7:38am

No you don’t need to.

The chance of that happening with sha256 is similar to chance of someone guessing a 40 characters password (with special characters).

You fill up earth with data and back it up with restic and you still don’t see it happening.

kuzmandi · October 16, 2021, 1:06pm

I need to, if I really want to get sure without having to trust something I can’t really know.

What about data errors on the backup disk (i.e. flash drive), months or years after starting with a repo? If I can re-check all blob checksums for that, why should it not be possible to compare it with original data too?

alexweiss · October 16, 2021, 6:35pm

The second sentence is your fallacy. To “really” validate a backup it does not suffice to restore and compare byte-by-byte.

The point is that there are many things that could go wrong with your restore (backend, remote connection, local storage where it will be saved,… - all kind of hardware issues) and many things that could go wrong with the comparison (source storage, restored storage, cpu, memory,… - all kind of hardware issues) which could give you a false positive validation.

This might occur with a very low probability and you could take measurement to even lower this probability (ECC-RAM, multiple runs on different CPUs, using several hard drives etc.) but as hardware can fail, you can never be 100% sure about your validation.

Now, with SHA-256 you can never be 100% sure that there is no hash collision, but it is very, very unlikely.

So, you might run a validation which is able to detect hash collision if it runs without errors. But if the probability of an erroneos validation run is much higher that the actual hash collision, there is not much sense in it.

So, in fact - as you have mentioned - your question really boils down to an extensive discussion about probabilities

kuzmandi · October 16, 2021, 8:08pm

First of all, thank you very much for all contributions to this discussion so far.

Yes, it may well be that errors in data processing occur statistically more often than hash collisions. I just find it a little difficult to leave one out on purpose just because the other is more likely. Especially if the solution were relatively simple (dry-run-restore and compare with source). It feels like not adding a lightning rod because I live in an earthquake area.

The fact that we trust the pseudo-uniqueness of hash values is only because the asymmetry makes reverse testing difficult for us. Because the fact that different content generates the same hash is undisputed, but since we cannot calculate which ones it would be, it is also difficult for us to find examples and so we live with probabilities.

If the probability of an incorrect, recurring (!) validation run which outputs a false-OK is so much bigger than the chance for a hash-collision (although content and amount of data are completely out of this calculation), we could design the validation in such a way that the chance of a false OK would be very small (e.g. two separate procedures with different CPU and memory usage, reading method, etc.)

You can of course increase the effort immeasurably, but I didn’t want to go there at all. I just wanted to encourage the integration of byte-by-byte verification after backup creation.

rawtaz · October 20, 2021, 7:25pm

I believe I understand your concern, you simply want to get rid of the extremely unlikely chance that there’s a hash collision. I dare say that the absolute majority of users think the risk is acceptable (just like the entire world who uses Git and similar software that has the same “issue”), but your wish is of course still valid.

Surely your suggestion could be implemented one way or the other. It would protect against the extremely unlikely hahs collision, but not have much effect on verifying or knowing about other types of “corruption” which are much much more likely to happen than a hash collision is.

What can I say - restic doesn’t have that as it is now (nor does e.g. Git), and I’m not seeing anyone writing it.

MelcomX · October 26, 2021, 8:15am

Hey there, just seen this topic and I just wanna also post my thoughts on it:

As others have mentioned before, the chance of hash collisions is not very high. But from a probabilistic perspective it grows in a randomly generated set of data with the count of blocks. So it is proportional to the repo size. Here it has been stated by @Eli6 that it is sha256, that means we have a probability function of 1 to 36^64. The probability that there is no collision therefore is 1-(1- (36^(-64)))^x. We see, that unless x is very high it is almost zero. (Note that Wolframalpha does not want to compute x for 50% probability) To give a relation: It is in the order, that 1 mol of atoms decide to all go up and hover for a very very short time. (Which violates the second law of thermodynamics, but that is also a statistical law)
But let us consider the case if this probability is reality, then somebody is not very happy.

I believe there is a easy solution(please correct me, if I’m wrong) to shift this probability even further to zero. What if each blob has a second checksum to double check if such a collision happened or not. I think, for this case, a light algorithm, which is not very strong(like MD5) would fulfill this purpose. But as long as those Algorithms do share some similar math, it could also happen for two algorithms to get this…

But it has to be noted, that due to the fact, that there is nothing like absolutely “sure”. Even byte-wise-comparison is not sure, due to diverse extra circumstances, such as Transistor errors, BitFlips on Hardware and so on. I do not want to calculate that(maybe I do it later), but under some circumstances, a hardware error could be more probably than a hash-colision.

torfason · November 1, 2021, 11:46am

Thanks for doing some calculations. My feeling is that if there was a real fear of collisions with sha256, it would be more straightforward to just go to sha512 rather than set up a safety hash. My intuition tells me that the probability of getting a double collision with sha256 (say, with some different salt or something) would simply be the same as having a single collision with sha512. Since md5 is only 128 bit, the chance of collision with a file pair using both sha256 and md5 would actually be orders of magnitudes greater than just using sha512. And I believe sha512 is supposed to be relatively efficient on modern processors.

But since I do not have a real fear of collisions with sha256, I just happily use the current version of restic, and worry about other stuff instead

MelcomX · November 2, 2021, 6:00am

Hey there, I just realised, that in my calculation I’ve made an error, every 36 is not a 36, it is actually 16.
But that does not change much in my argument, even though the relation is now more or less a bit less than 1 mol, but still not that far off. (Huge Numbers and so on)

I’ve searched some calculations (assuming no ECC), for bitflip in RAM:
According to statistics - Cosmic Rays: what is the probability they will affect a program? - Stack Overflow the chance of a bitflip is 1 Error per 256 Mbytes per month.
I believe it is quite obvious, that the chance of a bitflip that produces an error in hardware is quite bigger than the chance of a hash collision.

@torfason The reason why I suggested a controlhash is, that two hash-calculations are in my view more unlikely to produce the same hash-collision than one algorithm. (But maybe I should take a look at the math…)

But after this math, I come to the conclussion, that we’re all unsave and we should dismiss the idea of reliable computers.

kuzmandi · November 2, 2021, 7:53am

I am honestly surprised by the participation in this topic. Thanks for that!

At this point I would like to make it clear that my aim was not to make the existing hashing more secure, but to create an option so that I do not have to trust the process itself - not only the reliability of SHA-256 but also the implementation (!). So I would like an option that compares the stored data with the original source by reading-back-and-comparing.

That would mean that there would be no need to discuss probabilities any further…

Nev · November 2, 2021, 3:04pm

I’m not quite sure where you expected this conversation to go. Every decision in life is based on probabilities. Integrity-validation, especially so.

Restore the last snapshot and compare, in whatever way you desire. When you discover a mismatch, and can verify that it’s due to a hash collision, write a paper on it.