Restic check subset selection

Hello,

To verify the integrity of my backups, I’m using the check command with the --read-data-subset=10% flag.
I know this is selects a random subset of all packs to be checked and that there’s no guarantee that all packs will be checked with sufficient invocations.

Now, I also know a bout the n/t syntax but even if the documentation doesn’t say so, I have the feeling this could miss packs to be checked in the following situation:

  • Invocation 1, 1000 packs in the repository, called with 1/10 so the first 100 are checked
  • New backup, there are now 1200 packs in the repository
  • Invocation 2, called with 2/10 then packs from 121 to 240 are checked, missing those from 101 to 120

Am I right in my assumption here?

To avoid this situation, would it be possible to amend the x% method with a directed random selection? By that I mean that each pack gets assigned a check counter and that only those with the lowest value are considered for random selection on each invocation. This would give this situation:

  • Invocation 1, 1000 packs in the repository, called with 10%, the selected packs get their counter set to 1.
  • New backup, 1200 packs are now in the repository, the newly added packs get a counter set to 0.
  • Invocation 2, called again with 10%, only those with their counter set at 0 are considered for inclusion in the random draw.

If there are not enough packs with their counter set at the lowest value, then those with their value set to the lowest value + 1 are considered and so on until enough packs are selected (or all remaining packs are).

With this, I would get the assurance that all packs are checked at least once before they are checked again.

Please do not hesitate to comment on this, I’m not sure I have covered every corner case here.

1 Like

No. Using the n/t syntax, the packs to check are determined by the pack id which does not change. If you run a backup and after this t checks using 0/t..t/t, then always all packs from the backup are checked - independent from other backup runs. The situation changes if you run a prune in between.

About your proposal: there is no information saved about whether or when a given pack was checked. I think this will never be added as check is designed to be a read-only command, i.e. it should not modify the repository.

Ah, that’s exactly what I do, always prune after a backup.

That’s too bad but I understand.

If you run a backup and after this t checks using 0/t..t/t, then always all packs from the backup are checked - independent from other backup runs

If running a staggered cycle (e.g., 1/10 check already ran), and a new backup is created before check 2/10, the new packs that randomly fall into the already-checked group 1/10 will remain unchecked until the check of 1/10 is ran again.

Is this understanding of the check --read-data-subset=n/t statelessness correct?

Yes, this is correct.