Read-data, read-data-subset use cases

Hello.

Apart from a couple of posts touching more or less the subject, I could not really determine if --read-data is useful/recommended against a cloud storage backend.

What is considered best practice here ? I understand that this command could be useful to keep bits fresh on disk, and exercise the drive to prevent bad sectors becoming a problem (the drive would reallocate sectors via it’s SMART functionality).

However cloud storage tends to be redundant, so I’m not sure what to think?

I’m also confused about --read-data-subset and it’s recommended usage.

Let’s say I have a huge archive, and to prevent download costs (and time), I run the following command daily:

restic check --read-data-subset $(date +%j)/365 && \
  restic backup

As long as my backups are up to date via restic snapshots latest, I know I lowered the potential to have a corrupted restic archive, via the prerequisite restic check command.

How would this strategy be useful? Considering that, in reality, adding content to the archive means I won’t have checked 100% of it’s content after a year has passed. In other words, I could still have a corrupted archive without realizing.

What is the recommended approach here ?

From my experience, restic checks are too problematic to run outside the main backup script. As the repo gets too large, it becomes locked for longer and longer and check prevents backups from actually running. Or vice versa, the backup script prevents restic check from running because of the locked state, etc.

Thanks for any feedback

I don’t think it’s useful in that context; cloud storage providers will scrub on their own.

It’s also very unlikely to be useful if your repository is stored locally on a volume inside a redundant RAID level, if your environment automatically scrubs redundant RAIDs. (Debian will automatically scrub all redundant software RAIDs Sunday morning, for example.) In this case, the scrub process would be the first to notice any corruption, and it doesn’t require an exclusive lock on your repository.

1 Like

Thanks for your reply.

Any other feedback? @fd0 I’d appreciate your input :slight_smile:

1 Like

I personally think it’s useful even in a cloud context. You’re verifying that the cloud service gives you exactly the data back that you’ve sent to it, which is a good thing to very from time to time. After all, the data could have been corrupted at the service, you’ll only discover that when you check it regularly.

We’ve seen bugs in backends (and their respective client implementation) which were only discovered because people tried to read back the data. We’ve also seen middle-boxes mangling data before passing it on, destroying the data in the process. All of this would not have been discovered were it not for --read-data.

By the way, the second parameter of --read-data-subset n/m (the m) only makes sense for m <= 256. We should probably add that to the manual and print an error if m > 256. Here’s the code (pack[0] is a byte):

Just be careful when using this in a cloud context as you have to pay the necessary transfer fees to basically download your entire repository. (If you are using S3, it may be cheaper to spin up a short-lived EC2 in the same region and perform the check there. An m5.large is $0.096 per hour, which is just a hair more than the cost to transfer 1GB of data out to the Internet. You could probably check TBs of data on an m5.large in an hour.)

1 Like

Thanks @cdhowie, costs could indeed be a concern.

@fd0 If I understand correctly, the restic repo is represented by a slice of bytes at that level (GetPacks and IDSet)? If this is correct, and if it’s not possible to read while checking let’s say 1/1000th of the archive volume with the restic binary, I still don’t understand this feature, why it’s been created, and how it should be properly used?

Compared to a plain --read-data where the trade-off of reading and not reading is much clearer to me.

Also, from a practical point of view, does reading 1/10th of the data subset means I would be able to recover at least 1/10th of the repo’s data, or could a minute portion of the data (let’s say 100th of a percent) corrupt the whole archive? If that’s the case, I would understand that feature even less.

Thanks for your time, and keep up the good work! :slight_smile:

1 Like

A “pack” is a file in the repo, which has a name that’s the hexadecimal representation of 32 byte (SHA2 hash of its content). So uint(pack[0]) is the first byte, which can have values between 0 and 255. GetPacks() returns a list of all “pack” files in the repo, and the loop in lines 259-263 adds those packs to the list packs for which the first byte has some property based on the parameters passed to --read-data-subset. All files in that list are then downloaded and checked, in contrast to --read-data which just reads all files.

In order to support a finer granularity than 1/256 we would need to change this code a bit.

The idea for adding this feature was that on the one hand reading all data in the whole repository (via --read-data) is expensive (time, money) for cloud-based backends. On the other hand, not reading back the data at all before you really need it during restore is not great. So we added --read-data-subset, which allows reading and therefore checking some files from the repo. You can say “today please read one tenth of all files in the repo” and tomorrow “read the next tenth of all files in the repo”, and after ten days you’ve read and checked almost all files. The way the files that are to be checked are chosen guarantees that over time (almost) all files are read, module some minor details. Let me know if you’d like to know them :stuck_out_tongue:

Does this answer your question?

Thanks for the detailed explanation @fd0, the situation is much clearer to me now. I can only assume the first byte is chosen as an “anchor” to produce some kind of invariability with later read checks, rather than just using an increasing global counter (i%total).

The only remaining confusion for me is the implications of repo changes over time vs read data validity.

For simplicity, let’s say I would check + read 1/255th of the repo once a day, for the next 255 days.

  • day 1 : check + read packs where sha2 representation has first byte = 0
  • day 2 : check + read packs where sha2 representation has first byte = 1
  • day 3 : check + read packs where sha2 representation has first byte = 2
  • etc.

Meanwhile, the repo changes. And on day 30, 3 files get inserted (or are changed) and their sha2 representation becomes 0x0AAABBBCCC*, 0x1DDDEEEFFF*, 0x2ABABCCC*. <-- edit sorry

Those files then won’t get read again until the next “backup loop”, at least until 225 days. If a file’s content keeps changing, it may never be read and the highest the value in the m parameter in read-data-subset n/m, the less chance it has to be read if it keeps changing, is this correct? Therefore, the lower value of m is, the “better”. Is this correct?

Lastly, let’s say I have some kind of corruption regarding those 3 files (client bug, or server bug, or whatever): what are the implications of bad data going into the repo that way ?

If my repo contains 1 millions files after 255 days, and 3 of those files are corrupted/unreadable and won’t have been checked+read, can I still recover 999997 files? Or are there other side effects coming into play and rendering a potential restore operation problematic?

Thanks for your feedback, I’m going into all those details because I want to make my backups as rock solid as possible, and want to understand all the positive sides and negative sides of this feature.

1 Like

To be clear, files in a restic repository never change but they can be rewritten/consolidated by a prune operation to discard unused objects.

That’s correct. I believe this is the case that @fd0 meant when he said “over time (almost) all files are read.” (Emphasis mine.)

“Better” in the sense of more complete validation of the repository contents, yes. It does mean that more data is fetched from the repository (which is possibly not free depending on your backend) and more CPU time is spent verifying the data, which further translates to more time the repository is exclusively locked. This could be a problem for busy repositories.

The implications vary depending on what exactly is damaged. Data packs contain two kinds of objects: data blobs, which are pieces of a file’s raw contents, and tree blobs, which encode information about the contents of a directory.

If a data blob is damaged, the implications are simple: any file that contains that blob cannot be fully restored.

If a tree blob is damaged, the implications are a bit more complex: if that tree contains only other trees, then restic recover might be able to build new snapshots using each orphaned tree as the root. It’s a bit of a hassle to recover from these snapshots, but the data wouldn’t be lost.

However, in both cases, if the data is still present exactly on some backup client, deleting the affected packs and running restic rebuild-index (to discard the corrupted objects from the index used by clients to deduplicate) can allow future backups to “heal” the repository by reintroducing those objects to the repository.

1 Like

@cdhowie is right, that’s what I meant. Added pack files which start with a byte lower than the one checked on the current day get checked the next time around. So you’ll almost always have “unchecked” files in the repo, but the algorithm guarantees that it takes at most one cycle to get them checked. By tuning the length of the cycle (the second parameter to --read-data-subset) you can set how long this takes. Restic only ever adds and removes files, files in the repo are not changed (otherwise we would have to rename them).

If you want to make sure all files in the repo are valid at a certain point in time, you need to load and hash all files via check --read-data.

Thanks very much for your detailed feedback @cdhowie and @fd0, much appreciated.

As a general rule, I already decided to use many smaller repos, rather a few big ones. As I found that big repos tend to become more and more painful to use. Although restic does a fantastic job, at some point size or file count becomes a problem.

With that perspective, I’ll probably try to avoid read-data-subset as much as possible, favoring the simpler read-data coupled with a file organization where I can prevent reading again and again the same, mostly unchanged, data.

1 Like