Make check resumable

There is already a pull request #3425 for “Make restore resumable” and issue #407, but nothing for check.

Even more urgent from my perspective is that restic check --read-data is not started from the beginning after an interruption, because restic check surely happens more often than restic restore and a check of a repository takes longer than a restore of a snapshot.

What is your opinion?

  • Is “resumable check” already planned?
  • Should I create a feature request (enhancement)?

After about 40 hours “read all data” (43%) I got the message:

restic : subprocess ssh: client_loop: send disconnect: Connection reset
unable to refresh lock: ssh command exited: exit status 255

as well as several variants of:

Load(<data/…>, …, 0) returned error, retrying …: ReadFull: connection lost
pack … failed to download: StreamPack: connection lost

and at the end:

[40:12:53] 42.90% 31460 / 73325 packs
error while unlocking: ssh command exited: exit status 255 Fatal: repository contains errors

Sounds like you need to fix the connectivity issues you have there…

Meanwhile, making use of the --read-data-subset option to check might be useful!

Backups are important, even in an imperfect world with connectivity issues.
Please ignore the 2nd part (“After about 40 hours …”) of my original post.

Does it make any sense to use the --read-data-subset option to check the next part after a new backup was made?

I’m doing this right now. About 20% is checked in 24 hours. Normally a backup is made every day.

If restic would store the check results in its cache, the check could be resumed after the next backup.

Proposal:

  • restic check stores check results in its local cache and in a resume run it checks everything that was not checked before.
  • A command to reset the test results in the local cache would be required to start over.

Yes. Many users do this, and the point of this subset option is to check smaller parts of the repository more often, in order to cover it all without having to do it all in one go.

What values are you using for that option? Did you just split it in five total pieces? I would suggest many more pieces so each subset check run completes faster.

@TorstenC Could it be that you are basically looking for a check which would be able to run parallel to backup runs?

Yes, that’s right, now we got to the point. Thanks for clarification.

Hope it’s not too much off-topic, but you can run rustics check parallel to restic backup runs on a repository. It performs the identical checks as restic check and does not lock the repository as rustic is generally designed for lock-free operations.
You might however get some pack {id} not referenced in index. Can be a parallel backup job. warnings.

Yes, it ist possible to check random subsets.

It would be more reassuring and reliable if restic would not find an error by chance,
but if it could say at the end: 100% checked!

To this purpose, the following idea or proposal:

  • restic check stores check results in its local cache and the next subset is determined from the parts for which there are no check results yet.

Before you are talking about “100% checked”, you should try to define which kind of errors you would like check to detect.

The point is that the repository format is quite robust. A restic backup only adds new files to a repository and therefore cannot change already-backuped data. There are lots of checksums and integrity checks (also when running backup) which allow to detect changed data and even some self-healing can be achieved by the backup command. When backup or other commands save some repo files, they are saved in a way to hold integrity constraints, so even aborted command would not lead to a corrupted repo.

So, the main question is which error scenarios you would like to check:

  • Errors in the storage? (like bitflips)?
  • Errors when transmitting the data to the storage (wich may e.g. lead to truncated files)?
  • Errors in CPU/memory which might create invalid data?
  • Software errors in restic which might make restic work wrong in some (untested) cases?

Note that some of the above errors might even occur just after your check run reported “100% checked”…

Thanks for the examples on the error scenarios.

The --read-data flag tells Restic to verify the integrity of the pack files in the repository.

It may even happen that a pack file in a first subset is corrupted during the verification of the second subset. Also in this case, “100% checked” would be reported.
The error would be reported in a next check run after a reset of all test results.

Discussion 1:
Does it generally make sense to verify the integrity of pack files?

Discussion 2:
Do we need an option to check 100%, even if new pack files are generated during the check process?

In my opinion, Restic should be able to find errors not only by chance, but shhould be able to say at the end: 100% checked!

Does it generally make sense to verify the integrity of pack files?

Definitely. Without checking the pack files you may not have a backup.

Do we need an option to check 100%, even if new pack files are generated during the check process?

Restic creates an exclusive lock when a check is in progress. This prevents the repository being modified until the check has finished.

Ideally, the storage system uses some form of erasure encoding to ensure that the stored files are not corrupted. Then check primarily has to check that the host that created a backup did not upload corrupted data. In that combination it would indeed be possible to check each pack file only once (although that would need some way to ensure that the pack file is actually read from storage and not from some cache somewhere in the system).