What does "restic check" actually do?

I have my data backed up to Backblaze B2.

What does restic check actually do?

Backblaze and other cloud storage providers can return a hash via API. Does restic compare the cloud-hashes against the expected hashes as stored in the local cache?

What I’m hoping it does is rely on the cloud storage provider’s built-in hashing to confirm my backup is sound.

Thanks

By default it loads the metadata for all snapshots and checks that all data referenced by the snapshots is available. When --read-data is passed, it even downloads all data and verifies the data is correct. That requires downloading the data (which may be too expensive), so we’ve added --read-data-subset which allows reading a specified percentage of the data.

No, mainly because they don’t support computing the hash of the data with the hash function restic uses (SHA-256).

1 Like

I’m pushing 40 TB to BackBlaze B2 using Restic. Uploading that will take about 9 months. If I understand correctly, verifying it using the restic check just checks for the presence of the data on B2, but if I want to confirm the data is what I expect it do be, I’d have to restic check --read-data and endure a 40 TB download and hash computation…

In my humble opinion, restic should leverage the hashing functionality cloud providers include in their cloud storage solutions. These exist because at-scale, downloading all your data to verify it’s integrity is far too slow, far too expensive, and really doesn’t make sense.

I appreciate this means restric would need to compute hashes using the same function as the cloud storage back-end, but downloading everything to verify integrity is excessively wasteful and impractical.

Is restic calculating a SHA-1 checksum when uploading to B2? It’s recommended as a best-practice by Backblaze here - https://help.backblaze.com/hc/en-us/articles/218020298-Does-B2-require-a-SHA-1-hash-to-be-provided-with-an-upload-

Thanks

The nice thing about the way restic does its checking is that the repository is totally portable. (That’s really valuable, so you can back up to multiple locations.)

I believe a SHA-1 is already used (and basically required?) when uploading data to B2, to ensure the upload is successful and correct. Beyond this, are you expecting the files to be modified after you upload them?

If restic was calculating the SHA1 prior to sending, sending, and confirming the SHA1 post-transfer, I’d be willing to trust that the data is safely in b2. When I run a check, I’d expect restic to request the SHA1 and compare that with what restic expects.

I’ve been looking through the library restic uses to push data to Backblaze, and from what I can see it does not appear to calculate or even include the optional SHA-1 header in its post to b2.

This leaves me concerned that:
1 - We don’t know that data sent to b2 was in fact received correctly
2 - The only way to check would be to download the entire backup and calculate/ compare its hashes

I haven’t looked into restic’s integration of B2, but I know that rclone (https://rclone.org) uses the SHA-1 when uploading, so you could use restic’s rclone backend to upload to B2.

What I meant to say in my post above was: since restic uses SHA-256 for the file names in the repo, it has no knowledge of what the SHA-1 hash of the data looks like.

That’s not the case, the library we’re using (https://github.com/kurin/blazer) computes the SHA-1 hash of all data it sends to B2, as far as I’ve understood their API including the SHA-1 hash in the of the content in the HTTP request header is even mandatory. I suppose they compare the hash with what the server received.

That is the only way to ensure the data is accurate. For now, using restic you have to decide whether or not you trust the service (B2 in this case) to store your data. If you trust it, then just verifying that the files are there (this is what restic check does) is sufficient. If you don’t trust it, you can use check --read-data and actually retrieve the data to compute the hash locally.

What you’re proposing is something in between: Don’t download the data from the service, but trust it enough to compute a hash over the data. That’s not implemented right now. As far as I know there’s no way to tell B2 to compute a fresh hash,and not use a hash cached in a database from somewhere.

One practical aspect that would need to be solved is: Where is the SHA-1 hash of the data stored? We don’t have a place for that right now.

If we were to add such a feature, it’d need to accommodate all backends which may support a server-side hash. Likely this means computing not only SHA-1 but also MD5 and maybe others.

I hope this answers your questions.

1 Like