Understanding `--read-data-subset=n/t`

First of all, thank you all for the informations you shared in this topic.

I’ve read documentation, here: Checking integrity and consistency

However, it is unclear to me how this command works:

--read-data-subset=n/t

I’m interested in it, because I’d like to do a deep check (so, using –read-data) on my backup; but it takes too many resources to do a complete one after every execution. So, it would be great to split it over multiple days. I wonder if this command guarantees that doing:

--read-data-subset=1/7 on Monday

--read-data-subset=2/7 on Tuesday

….

--read-data-subset=7/7 on Sunday

all files in the backup are covered. I mean, e.g., on Tuesday, how can Restic know what files have already been checked on Monday?

You are right that the set of --read-data-subset options you wrote will cover and check the entire repository over time, yes.

What happens is that restic takes the entire repository, “mentally” splits it up into seven pieces (t), and then checks only the one piece out of those seven pieces, that you mentioned in the number before the / (n). On Monday the first piece, on Tuesday the second piece, and so on. At the end of the week, you will have checked all seven pieces of the repository.

Restic doesn’t know what you checked in previous runs. All it knows is that this time, you want to check piece n out of t pieces, and then it does that. Because it likes you and wants to make you happy :slight_smile:

Can you tell us which parts of the documentation about the --read-data-subset syntax that you felt were hard to understand, and if you have any ideas about how to improve it?

In addition please note that this is deterministic. Files to check are decided based on their hashes (which happens are also their names in hex). So for example –read-data-subset=11/16 will read all files starting with a as all possible 16 values are 0…f(hexadecimal). As these hashes are random it is very good approximation of files’ subset.

It means that if you run [1..7]/7 you have 100% guarantee that by Sunday you will check all files that were present on Monday + some (but probably not all) files created during that week.

3 Likes

This is a very good answer, thanks kapitainsky

1 Like

BTW - it also means that to make checked parts as equal as possible it is good idea to use n as power of 2. But I do not think it is really critical.

This is what I am using for checks. It avoids problems with missed checks if some simple day counting only is used by storing latest part number in some file:

#!/usr/bin/env bash
set -o errexit

# file to remember data subset part to check
memo_file="/path/to/restic_check_part_memo_file.txt"

# number of parts used for check
m=32

# data subset part to check
[ -f "${memo_file}" ] || printf "1" > "${memo_file}"
n=$(cat "${memo_file}")

restic check --read-data-subset "${n}"/"${m}"

# data subset part to check management
n=$((n+1))
if [ ${n} -gt ${m} ]; then
  # start again from the beginning
  printf "1" > "${memo_file}"
else
  printf '%s' "${n}" > "${memo_file}"
fi

Actually restic currently only uses the first byte of the id. So, for small numbers this doesn’t really matter. For large ones, you may have some small and some larger (up to 2x) selections. And more than 256 is currently not possible.

2 Likes

@Denis Just for housekeeping purposes, please pick one of the answers and select it as being the Solution, so that we can mark this thread as resolved :slight_smile: Thanks!

Sorry, @kapitainsky , now I think I’m missing something… It seems to me that this last your post contradicts your previous one… After the previous one, I was convinced I could simply run check with:

–read-data-subset=1/7 on Monday

–read-data-subset=2/7 on Tuesday

–read-data-subset=7/7 on Sunday

to obtain:

And moreover, it comes from this that the files created during the week W0, will be completely checked in the week W0+1.

So I thought I will not have to do anything more than that.

In this last your post, you are saying you are using a more complex logic, using an ad-hoc file (“memo_file”)… So I do need this additional logic to be sure to obtain:

?


@Denis Your intuition was right – you do not need anything more complicated than what you wrote:

–read-data-subset=1/7 on Monday

–read-data-subset=2/7 on Tuesday

–read-data-subset=7/7 on Sunday

The script @kapitainsky showed is simply a helper script that keeps track of which n (piece) was last checked, and makes sure that the next time the script is run, the next n (piece) is the one that will be checked.

In other words, it changes what you have to do from running one out of seven commands, into running just one single command. But at the end of the day, if you are dividing the checks into seven week days, there’s no problem to solve here and no need for the script, since you can just write one single scheduled command to use “the current day number of the week” as n anyway.

How did you implement the running of those seven commands so that the correct n was used on each day?