Make check resumable

TorstenC · April 9, 2023, 7:40pm

There is already a pull request #3425 for “Make restore resumable” and issue #407, but nothing for check.

Even more urgent from my perspective is that restic check --read-data is not started from the beginning after an interruption, because restic check surely happens more often than restic restore and a check of a repository takes longer than a restore of a snapshot.

What is your opinion?

Is “resumable check” already planned?
Should I create a feature request (enhancement)?

After about 40 hours “read all data” (43%) I got the message:

restic : subprocess ssh: client_loop: send disconnect: Connection reset
unable to refresh lock: ssh command exited: exit status 255

as well as several variants of:

Load(<data/…>, …, 0) returned error, retrying …: ReadFull: connection lost
pack … failed to download: StreamPack: connection lost

and at the end:

[40:12:53] 42.90% 31460 / 73325 packs
error while unlocking: ssh command exited: exit status 255 Fatal: repository contains errors

rawtaz · April 9, 2023, 11:45pm

Sounds like you need to fix the connectivity issues you have there…

Meanwhile, making use of the --read-data-subset option to check might be useful!

TorstenC · April 11, 2023, 5:54am

Backups are important, even in an imperfect world with connectivity issues.
Please ignore the 2nd part (“After about 40 hours …”) of my original post.

Does it make any sense to use the --read-data-subset option to check the next part after a new backup was made?

I’m doing this right now. About 20% is checked in 24 hours. Normally a backup is made every day.

If restic would store the check results in its cache, the check could be resumed after the next backup.

Proposal:

restic check stores check results in its local cache and in a resume run it checks everything that was not checked before.
A command to reset the test results in the local cache would be required to start over.

rawtaz · April 11, 2023, 8:09pm

Yes. Many users do this, and the point of this subset option is to check smaller parts of the repository more often, in order to cover it all without having to do it all in one go.

What values are you using for that option? Did you just split it in five total pieces? I would suggest many more pieces so each subset check run completes faster.

alexweiss · April 12, 2023, 6:12am

@TorstenC Could it be that you are basically looking for a check which would be able to run parallel to backup runs?

TorstenC · April 15, 2023, 8:50pm

Yes, that’s right, now we got to the point. Thanks for clarification.

alexweiss · April 16, 2023, 8:15am

Hope it’s not too much off-topic, but you can run rustics check parallel to restic backup runs on a repository. It performs the identical checks as restic check and does not lock the repository as rustic is generally designed for lock-free operations.
You might however get some pack {id} not referenced in index. Can be a parallel backup job. warnings.

TorstenC · April 16, 2023, 8:39am

Yes, it ist possible to check random subsets.

It would be more reassuring and reliable if restic would not find an error by chance,
but if it could say at the end: 100% checked!

To this purpose, the following idea or proposal:

restic check stores check results in its local cache and the next subset is determined from the parts for which there are no check results yet.

alexweiss · April 16, 2023, 9:28am

Before you are talking about “100% checked”, you should try to define which kind of errors you would like check to detect.

The point is that the repository format is quite robust. A restic backup only adds new files to a repository and therefore cannot change already-backuped data. There are lots of checksums and integrity checks (also when running backup) which allow to detect changed data and even some self-healing can be achieved by the backup command. When backup or other commands save some repo files, they are saved in a way to hold integrity constraints, so even aborted command would not lead to a corrupted repo.

So, the main question is which error scenarios you would like to check:

Errors in the storage? (like bitflips)?
Errors when transmitting the data to the storage (wich may e.g. lead to truncated files)?
Errors in CPU/memory which might create invalid data?
Software errors in restic which might make restic work wrong in some (untested) cases?

Note that some of the above errors might even occur just after your check run reported “100% checked”…

TorstenC · April 17, 2023, 10:39am

Thanks for the examples on the error scenarios.

The --read-data flag tells Restic to verify the integrity of the pack files in the repository.

It may even happen that a pack file in a first subset is corrupted during the verification of the second subset. Also in this case, “100% checked” would be reported.
The error would be reported in a next check run after a reset of all test results.

Discussion 1:
Does it generally make sense to verify the integrity of pack files?

Discussion 2:
Do we need an option to check 100%, even if new pack files are generated during the check process?

In my opinion, Restic should be able to find errors not only by chance, but shhould be able to say at the end: 100% checked!

ProactiveServices · April 17, 2023, 2:21pm

Does it generally make sense to verify the integrity of pack files?

Definitely. Without checking the pack files you may not have a backup.

Do we need an option to check 100%, even if new pack files are generated during the check process?

Restic creates an exclusive lock when a check is in progress. This prevents the repository being modified until the check has finished.

MichaelEischer · April 17, 2023, 7:40pm

Ideally, the storage system uses some form of erasure encoding to ensure that the stored files are not corrupted. Then check primarily has to check that the host that created a backup did not upload corrupted data. In that combination it would indeed be possible to check each pack file only once (although that would need some way to ensure that the pack file is actually read from storage and not from some cache somewhere in the system).

atentonfle · November 9, 2023, 4:15pm

I would like to see this feature.

I am using restic check --read-data for a repository on the Internet according to the changelog for restic 0.16.1. (due to possible data corruption at max. compression In restic 0.16.0). I estimate it will take 100+ hours to check, and I just find I can’t pause it and I have to leave the computer on until the check is complete. And I have to pray that the Internet will not be disconnected during this period.

kapitainsky · November 9, 2023, 4:33pm

Do not run it in one go. My lazy way is like below:

I create batch script check_repo.sh:

#!/usr/bin/env bash

# exit on error
set -o errexit

echo"1/100"; restic check --read-data-subset=1/100
echo "2/100"; restic check --read-data-subset=2/100
...
echo"99/100"; restic check --read-data-subset=99/100
echo"100/100"; restic check --read-data-subset=100/100

and run it as:

check_repo.sh > check_repo.log

when it fails at some stage I just comment already done lines and carry on.

I can also ctrl-c and continue later if needed.

If you like bash kungfu you can improve it easy (loop instead of 100 lines for example) or even fully automate it preserving already done chunks in some file

atentonfle · November 9, 2023, 5:00pm

Thanks! This is a good idea.

Let’s say my repository is 1000G.
After restic check --read-data-subset 1/2 finishes, I do some backups to the repository, and the reopistory is 1500G now. If I run restic check --read-data-subset 2/3 and restic check --read-data-subset 3/3, can all data of the repository be checked? (in another word, are the new data ‘inserted into’ the first 1/3 part or are they ‘appended’, for the reading order of ‘check’?)

kapitainsky · November 9, 2023, 5:04pm

I doubt they are “inserted”.

But this is the same problem when you run restic check --read-data and new backups in parallel I think.

IMO if you really want to check all repo - just run check only.

MichaelEischer · November 10, 2023, 10:34pm

The pack files in a repository are assigned to subsets based on their filename using a mapping method that is independent on the number of pack files. That is, a specific file will always end up in the same subset independent of what else changes in a repository.

Let’s say you’ve got a repository that contains 300 pack files. Then subset 1 might contain 98 files, subset 2 103 and subset 3 99 files. Now you run --read-data-subset 1/3, afterwards a new backup adds 30 new pack files that equally distributed to the subsets (10 each). Then a check of subset 2 will process to original 103 pack files and 10 of the new files. Checking subsets 1/3, 2/3 and 3/3 will therefore verify all data that existed before the first check run, and a subset of the data added afterwards.

It’s advisable to use a power of two for the overall number of subsets, e.g. --read-data-subset x/64. This ensures that each check run processes roughly the same amount of data.

atentonfle · November 16, 2023, 3:33pm

How do you avoid entering restic repository password 100 times?

kapitainsky · November 16, 2023, 3:42pm

export RESTIC_PASSWORD="password" is one way.

More options described in docs.

noeck · November 16, 2023, 8:23pm

Btw, if you start the export command with spaces in bash (with HISTCONTROL=ignorespace), it will not end up in the command history.

echo $HISTCONTROL
  echo Test
history | tail