By default it loads the metadata for all snapshots and checks that all data referenced by the snapshots is available. When --read-data is passed, it even downloads all data and verifies the data is correct. That requires downloading the data (which may be too expensive), so we’ve added --read-data-subset which allows reading a specified percentage of the data.
No, mainly because they don’t support computing the hash of the data with the hash function restic uses (SHA-256).
I’m pushing 40 TB to BackBlaze B2 using Restic. Uploading that will take about 9 months. If I understand correctly, verifying it using the restic check just checks for the presence of the data on B2, but if I want to confirm the data is what I expect it do be, I’d have to restic check --read-data and endure a 40 TB download and hash computation…
In my humble opinion, restic should leverage the hashing functionality cloud providers include in their cloud storage solutions. These exist because at-scale, downloading all your data to verify it’s integrity is far too slow, far too expensive, and really doesn’t make sense.
I appreciate this means restric would need to compute hashes using the same function as the cloud storage back-end, but downloading everything to verify integrity is excessively wasteful and impractical.
The nice thing about the way restic does its checking is that the repository is totally portable. (That’s really valuable, so you can back up to multiple locations.)
I believe a SHA-1 is already used (and basically required?) when uploading data to B2, to ensure the upload is successful and correct. Beyond this, are you expecting the files to be modified after you upload them?
If restic was calculating the SHA1 prior to sending, sending, and confirming the SHA1 post-transfer, I’d be willing to trust that the data is safely in b2. When I run a check, I’d expect restic to request the SHA1 and compare that with what restic expects.
I’ve been looking through the library restic uses to push data to Backblaze, and from what I can see it does not appear to calculate or even include the optional SHA-1 header in its post to b2.
This leaves me concerned that:
1 - We don’t know that data sent to b2 was in fact received correctly
2 - The only way to check would be to download the entire backup and calculate/ compare its hashes
What I meant to say in my post above was: since restic uses SHA-256 for the file names in the repo, it has no knowledge of what the SHA-1 hash of the data looks like.
That’s not the case, the library we’re using (https://github.com/kurin/blazer) computes the SHA-1 hash of all data it sends to B2, as far as I’ve understood their API including the SHA-1 hash in the of the content in the HTTP request header is even mandatory. I suppose they compare the hash with what the server received.
That is the only way to ensure the data is accurate. For now, using restic you have to decide whether or not you trust the service (B2 in this case) to store your data. If you trust it, then just verifying that the files are there (this is what restic check does) is sufficient. If you don’t trust it, you can use check --read-data and actually retrieve the data to compute the hash locally.
What you’re proposing is something in between: Don’t download the data from the service, but trust it enough to compute a hash over the data. That’s not implemented right now. As far as I know there’s no way to tell B2 to compute a fresh hash,and not use a hash cached in a database from somewhere.
One practical aspect that would need to be solved is: Where is the SHA-1 hash of the data stored? We don’t have a place for that right now.
If we were to add such a feature, it’d need to accommodate all backends which may support a server-side hash. Likely this means computing not only SHA-1 but also MD5 and maybe others.
My apologies for reviving this 3-year old topic, but I believe this is still relevant and would be a desirable feature for restic to have.
Obviously the ideal situation would be to have full access to the remote storage server and have restic installed there directly which would allow to run restic check --read-data on the data directly. However, with most (not to say all) cloud storage provides that is simply not possible, and thus I believe there is some value for the “in between” solution where restic would offer to leverage the hash functions provided by the various cloud providers to do integrity checks of the data. I agree, there are some technical questions to be sorted out (e.g. as @fd0 mentioned: where to store the additional hash?), but that could be sorted out if this would be something the developers were to support in general.
rclone maintains a list of the hash functions provided by the various cloud storage providers: Overview of cloud storage systems
In most cases, this comes down to either SHA1 or MD5.
Well, another “in between” solution would be to choose a provider that supports (in some way) the hash restic is using (or ask it to support it). If you get the SHA256 of the repo files, all you have to do is compare them with the file names.
A remote shell with a sha256sum or something equivalent will also work - no need to run restic directly. Another example would be to use AWS S3 and calculate the SHA256 cheaply using some lamda function or a virtual machine.
But note, that check --read-data does even more checks (decryption, consistency checks) which you never get by only comparing hashes.
Thanks for your reply and further insight to this @alexweiss. That is indeed a quite viable “in between” solution and I have been looking into. Aside from the AWS S3 option you mentioned, I found that pCloud supports SHA256 checksums for data in their European data region/API (for details see https://docs.pcloud.com/methods/file/checksumfile.html).
The issue, as @fd0 pointed out, is that this is extremely unlikely to even do anything useful:
Backblaze does not disclose where the hash they return comes from. Considering that you can store gigantic files in B2 and fetch the metadata nearly instantly, it is extremely unlikely that they are hashing the contents of the file every time you request metadata. It is much more likely that the hash has been stored with the rest of the file’s metadata and blindly returned when you list the file.
This does not guarantee in any way that if you fetch the file, you will receive data with the same hash that B2 claims. The file could be damaged on B2 – or the hash could get damaged, while the data is fine.
When uploading to B2, it is mandatory to provide a SHA-1 hash of the file/piece being uploaded and the server will reject the upload if the data does not match. The B2 client restic uses already does this. That the upload is successful is proof that the server has hashed the data on its end and verified the hash.
At this point, the hash returned by B2 in the metadata is entirely useless, for the reasons I mentioned above. Consider that the filename itself is a hash, but the name is not a guarantee that the file actually contains data with that hash. The only way to be sure the file is not damaged is hash the contents and see if it matches – and this is exactly the same scenario as with the hash provided by B2; if you don’t hash it yourself, you can’t be sure there is no damage.
I think it would be useful, if this discussion is to get anywhere, to give an example of exactly the kind of damage that this additional type of check would be expected to detect. I would submit to you that there is no damage that this check would detect, except corruption of the hash itself as recorded in server-side metadata.