There is already a pull request #3425 for “Make restore resumable” and issue #407, but nothing for check.
Even more urgent from my perspective is that restic check --read-data is not started from the beginning after an interruption, because restic check surely happens more often than restic restore and a check of a repository takes longer than a restore of a snapshot.
What is your opinion?
Is “resumable check” already planned?
Should I create a feature request (enhancement)?
After about 40 hours “read all data” (43%) I got the message:
restic : subprocess ssh: client_loop: send disconnect: Connection reset
unable to refresh lock: ssh command exited: exit status 255
as well as several variants of:
Load(<data/…>, …, 0) returned error, retrying …: ReadFull: connection lost
pack … failed to download: StreamPack: connection lost
and at the end:
[40:12:53] 42.90% 31460 / 73325 packs
error while unlocking: ssh command exited: exit status 255 Fatal: repository contains errors
Yes. Many users do this, and the point of this subset option is to check smaller parts of the repository more often, in order to cover it all without having to do it all in one go.
What values are you using for that option? Did you just split it in five total pieces? I would suggest many more pieces so each subset check run completes faster.
Hope it’s not too much off-topic, but you can run rustics check parallel to restic backup runs on a repository. It performs the identical checks as restic check and does not lock the repository as rustic is generally designed for lock-free operations.
You might however get some pack {id} not referenced in index. Can be a parallel backup job. warnings.
Before you are talking about “100% checked”, you should try to define which kind of errors you would like check to detect.
The point is that the repository format is quite robust. A restic backup only adds new files to a repository and therefore cannot change already-backuped data. There are lots of checksums and integrity checks (also when running backup) which allow to detect changed data and even some self-healing can be achieved by the backup command. When backup or other commands save some repo files, they are saved in a way to hold integrity constraints, so even aborted command would not lead to a corrupted repo.
So, the main question is which error scenarios you would like to check:
Errors in the storage? (like bitflips)?
Errors when transmitting the data to the storage (wich may e.g. lead to truncated files)?
Errors in CPU/memory which might create invalid data?
Software errors in restic which might make restic work wrong in some (untested) cases?
Note that some of the above errors might even occur just after your check run reported “100% checked”…
The --read-data flag tells Restic to verify the integrity of the pack files in the repository.
It may even happen that a pack file in a first subset is corrupted during the verification of the second subset. Also in this case, “100% checked” would be reported.
The error would be reported in a next check run after a reset of all test results.
Discussion 1:
Does it generally make sense to verify the integrity of pack files?
Discussion 2:
Do we need an option to check 100%, even if new pack files are generated during the check process?
In my opinion, Restic should be able to find errors not only by chance, but shhould be able to say at the end: 100% checked!
Ideally, the storage system uses some form of erasure encoding to ensure that the stored files are not corrupted. Then check primarily has to check that the host that created a backup did not upload corrupted data. In that combination it would indeed be possible to check each pack file only once (although that would need some way to ensure that the pack file is actually read from storage and not from some cache somewhere in the system).