However, it is unclear to me how this command works:
--read-data-subset=n/t
I’m interested in it, because I’d like to do a deep check (so, using –read-data) on my backup; but it takes too many resources to do a complete one after every execution. So, it would be great to split it over multiple days. I wonder if this command guarantees that doing:
--read-data-subset=1/7 on Monday
--read-data-subset=2/7 on Tuesday
….
--read-data-subset=7/7 on Sunday
all files in the backup are covered. I mean, e.g., on Tuesday, how can Restic know what files have already been checked on Monday?
You are right that the set of --read-data-subset options you wrote will cover and check the entire repository over time, yes.
What happens is that restic takes the entire repository, “mentally” splits it up into seven pieces (t), and then checks only the one piece out of those seven pieces, that you mentioned in the number before the / (n). On Monday the first piece, on Tuesday the second piece, and so on. At the end of the week, you will have checked all seven pieces of the repository.
Restic doesn’t know what you checked in previous runs. All it knows is that this time, you want to check piece n out of t pieces, and then it does that. Because it likes you and wants to make you happy
Can you tell us which parts of the documentation about the --read-data-subset syntax that you felt were hard to understand, and if you have any ideas about how to improve it?
In addition please note that this is deterministic. Files to check are decided based on their hashes (which happens are also their names in hex). So for example –read-data-subset=11/16 will read all files starting with a as all possible 16 values are 0…f(hexadecimal). As these hashes are random it is very good approximation of files’ subset.
It means that if you run [1..7]/7 you have 100% guarantee that by Sunday you will check all files that were present on Monday + some (but probably not all) files created during that week.
BTW - it also means that to make checked parts as equal as possible it is good idea to use n as power of 2. But I do not think it is really critical.
This is what I am using for checks. It avoids problems with missed checks if some simple day counting only is used by storing latest part number in some file:
#!/usr/bin/env bash
set -o errexit
# file to remember data subset part to check
memo_file="/path/to/restic_check_part_memo_file.txt"
# number of parts used for check
m=32
# data subset part to check
[ -f "${memo_file}" ] || printf "1" > "${memo_file}"
n=$(cat "${memo_file}")
restic check --read-data-subset "${n}"/"${m}"
# data subset part to check management
n=$((n+1))
if [ ${n} -gt ${m} ]; then
# start again from the beginning
printf "1" > "${memo_file}"
else
printf '%s' "${n}" > "${memo_file}"
fi
Actually restic currently only uses the first byte of the id. So, for small numbers this doesn’t really matter. For large ones, you may have some small and some larger (up to 2x) selections. And more than 256 is currently not possible.