Fatal: repository contains errors => how best to respond?

TomCloyd · November 3, 2021, 12:10am

Weekly, I run “prune” and “check --read-data” on all my repositories, which are merely backups to my /home account on my Kubuntu Linux OS. Last run I got this error:

create exclusive lock for repository
load indexes
check all packs
check snapshots, trees and blobs
[0:04] 100.00%  11 / 11 snapshots
read all data
pack 9302e12c contains 1 errors: [blob 0: ciphertext verification failed]
[31:28] 100.00%  7127 / 7127 packs
Fatal: repository contains errors

Is there any recovery possible here? I realize I could just delete and recreate the repository, but I’m wondering is there might be a less drastic solution.

UPDATE as of Nov. 3, 12:42 PM PDT (Seattle) =====

I’m glad to have stimulated a discussion here. One overarching impression I get is that the error is actually a group of errors AND that the error message isn’t telling us forever-amateurs much. THAT can be fixed. Profile - rawtaz - restic forum speaks of the corruption’s being outside of restic. Well, obviously, but WHAT is the corruption? Restic isn’t telling all it knows, I suspect. And most importantly, what is an appropriate response?

So far the only one I can grasp is to recreate the repository. That, at least, restarts my twice-daily backup cycle, though not without cost.

rawtaz · November 3, 2021, 1:11am

I wouldn’t expect that you need to start over, better see if you can fix this one up. I was about to write some steps for that but realized that @MichaelEischer might have some investigation suggestions of interest first.

Eli6 · November 3, 2021, 4:45pm

These errors are quite common and are bothering me. I get tens of Pack and Blob ID don’t match errors.

Could an option be added to fix this?

For pack ID errors, if you still have the data source, you can remove the damaged packs from the repository, rebuild-index and backup again.

How about blob IDs? I suppose you can find their packs and remove those.

See the response of @MichaelEischer in issue 2191 in GitHub below. It’s few lines of code to automate this process, which is painstaking and error prune to do manually when there are tens of damaged packs.

I don’t know why a repair option is not added to restic. It may not solve all types of errors, but pack and blob IDs errors can be fixed.

github.com/restic/restic

Unclear how to recover from "pack ID does not match" errors from "restic check"

opened 04:27PM - 27 Feb 19 UTC

adsbarratt

state: need feedback category: prune

Output of `restic version` -------------------------- restic 0.9.2 compiled wi…th go1.10.3 on windows/amd64 (for initial backup and check) restic 0.9.4 compiled with go1.11.4 on windows/amd64 (for later diagnosis) How did you run restic exactly? ------------------------------- AWS_ACCESS_KEY_ID=foo AWS_SECRET_ACCESS_KEY=foo RESTIC_REPOSITORY=s3:https://s3.amazonaws.com/myrepo RESTIC_PASSWORD=foo:bar Ran `restic check --check-unused --read-data` against the repository after generating 3 snapshots. Output included: > Pack ID does not match, want 7691f738, got e3b0c442 What backend/server/service did you use to store the repository? ---------------------------------------------------------------- S3 Expected behavior ----------------- A simpe means of resolving the error, given that I still have the original data available locally. Actual behavior --------------- `restic find --pack 7691f738` got me as far as > Found blob 1b0ccb7af24b6221798ee900d9f5943e56a186f3f30707e5a7366033415a3e50 ... in file /f/2018-04/foo-Sun.7z (tree 3229cbefa8f2e1b09726ef36f701516d01ae0d8c80339125508ca62e23ab7479) ... in snapshot 5c8529d9 (2019-02-06 15:58:21) Found blob 11e01454070d44043df7536c6b9a2cde366d7aa97a066daf10d1049d4c7d1c55 ... in file /f/2018-04/foo-web-Sun.7z (tree 3229cbefa8f2e1b09726ef36f701516d01ae0d8c80339125508ca62e23ab7479) ... in snapshot 5c8529d9 (2019-02-06 15:58:21) but it's unclear how I can resolve the issues. Will re-uploading the affected files and then running a forget on the repo simply replace the broken pack with the new version? (Admittedly in a new snapshot, but that's OK.) Steps to reproduce the behavior ------------------------------- Upload ~1.5TB of data from a USB-attached SSD from a Windows server to an S3-hosted repo. If it makes any difference, the upload was rate-limited and interrupted / restarted at a couple of points in order to adjust the rate limit. Unfortunately I'm not sure whether any of the affected files were being uploaded at the time. Do you have any idea what may have caused this? ----------------------------------------------- Possible I/O issue, as the drives are connected to a Windows server via a USB to SATA cable and at least once (although not while restic was running) the event log shows that Windows believes the drive was disconnected and reconnected (it wasn't). Do you have an idea how to solve the issue? ------------------------------------------- Better worked example of recovering from issues raised by `restic check`, particularly when --read-data was used. Did restic help you or made you happy in any way? ------------------------------------------------- Generally I've been very impressed with restic so far.

rawtaz · November 3, 2021, 5:45pm

What is this even supposed to mean? You are getting those messages because something is wrong in your infrastructure. How do you propose that we “add an option” to fix problems with your infrastructure? Seriously. The only “fix” to the problem is to find the root cause of those errors and fix that.

Eli6 · November 3, 2021, 6:10pm

The source of the problem doesn’t matter. You can see that the OP, as well as many others, also report the same issue. An error could arise from anywhere.

And the procedure to fix that is clear (see issue 2191 above). I understand you may not be able to do it. Thus, I hope that the restic developers @fd0 or @MichaelEischer would chime in about the possibility of adding a repair option that would try to fix some of the errors.

These are sometimes parts of the back up software, see the option repair in Borg.

rawtaz · November 3, 2021, 6:24pm

You can take measures to fix a corrupted repository after the fact, but your suggestion to add an option to prevent the corruption in the first place, when that corruption is outside of restic, is obviously not possible in the real world.

TomCloyd · November 5, 2021, 7:11pm

There seems to be no more discussion on this issue. Reviewing Github issue 2191 - Unclear how to recover from "pack ID does not match" errors from "restic check" · Issue #2191 · restic/restic · GitHub - I see a lot of information there, much of which is over my head.

I do see potentially useful ideas here:

github.com/restic/restic

Unclear how to recover from "pack ID does not match" errors from "restic check"

opened 04:27PM - 27 Feb 19 UTC

adsbarratt

state: need feedback category: prune

Output of `restic version` -------------------------- restic 0.9.2 compiled wi…th go1.10.3 on windows/amd64 (for initial backup and check) restic 0.9.4 compiled with go1.11.4 on windows/amd64 (for later diagnosis) How did you run restic exactly? ------------------------------- AWS_ACCESS_KEY_ID=foo AWS_SECRET_ACCESS_KEY=foo RESTIC_REPOSITORY=s3:https://s3.amazonaws.com/myrepo RESTIC_PASSWORD=foo:bar Ran `restic check --check-unused --read-data` against the repository after generating 3 snapshots. Output included: > Pack ID does not match, want 7691f738, got e3b0c442 What backend/server/service did you use to store the repository? ---------------------------------------------------------------- S3 Expected behavior ----------------- A simpe means of resolving the error, given that I still have the original data available locally. Actual behavior --------------- `restic find --pack 7691f738` got me as far as > Found blob 1b0ccb7af24b6221798ee900d9f5943e56a186f3f30707e5a7366033415a3e50 ... in file /f/2018-04/foo-Sun.7z (tree 3229cbefa8f2e1b09726ef36f701516d01ae0d8c80339125508ca62e23ab7479) ... in snapshot 5c8529d9 (2019-02-06 15:58:21) Found blob 11e01454070d44043df7536c6b9a2cde366d7aa97a066daf10d1049d4c7d1c55 ... in file /f/2018-04/foo-web-Sun.7z (tree 3229cbefa8f2e1b09726ef36f701516d01ae0d8c80339125508ca62e23ab7479) ... in snapshot 5c8529d9 (2019-02-06 15:58:21) but it's unclear how I can resolve the issues. Will re-uploading the affected files and then running a forget on the repo simply replace the broken pack with the new version? (Admittedly in a new snapshot, but that's OK.) Steps to reproduce the behavior ------------------------------- Upload ~1.5TB of data from a USB-attached SSD from a Windows server to an S3-hosted repo. If it makes any difference, the upload was rate-limited and interrupted / restarted at a couple of points in order to adjust the rate limit. Unfortunately I'm not sure whether any of the affected files were being uploaded at the time. Do you have any idea what may have caused this? ----------------------------------------------- Possible I/O issue, as the drives are connected to a Windows server via a USB to SATA cable and at least once (although not while restic was running) the event log shows that Windows believes the drive was disconnected and reconnected (it wasn't). Do you have an idea how to solve the issue? ------------------------------------------- Better worked example of recovering from issues raised by `restic check`, particularly when --read-data was used. Did restic help you or made you happy in any way? ------------------------------------------------- Generally I've been very impressed with restic so far.

and possibly here:

github.com/restic/restic

Unclear how to recover from "pack ID does not match" errors from "restic check"

opened 04:27PM - 27 Feb 19 UTC

adsbarratt

state: need feedback category: prune

Output of `restic version` -------------------------- restic 0.9.2 compiled wi…th go1.10.3 on windows/amd64 (for initial backup and check) restic 0.9.4 compiled with go1.11.4 on windows/amd64 (for later diagnosis) How did you run restic exactly? ------------------------------- AWS_ACCESS_KEY_ID=foo AWS_SECRET_ACCESS_KEY=foo RESTIC_REPOSITORY=s3:https://s3.amazonaws.com/myrepo RESTIC_PASSWORD=foo:bar Ran `restic check --check-unused --read-data` against the repository after generating 3 snapshots. Output included: > Pack ID does not match, want 7691f738, got e3b0c442 What backend/server/service did you use to store the repository? ---------------------------------------------------------------- S3 Expected behavior ----------------- A simpe means of resolving the error, given that I still have the original data available locally. Actual behavior --------------- `restic find --pack 7691f738` got me as far as > Found blob 1b0ccb7af24b6221798ee900d9f5943e56a186f3f30707e5a7366033415a3e50 ... in file /f/2018-04/foo-Sun.7z (tree 3229cbefa8f2e1b09726ef36f701516d01ae0d8c80339125508ca62e23ab7479) ... in snapshot 5c8529d9 (2019-02-06 15:58:21) Found blob 11e01454070d44043df7536c6b9a2cde366d7aa97a066daf10d1049d4c7d1c55 ... in file /f/2018-04/foo-web-Sun.7z (tree 3229cbefa8f2e1b09726ef36f701516d01ae0d8c80339125508ca62e23ab7479) ... in snapshot 5c8529d9 (2019-02-06 15:58:21) but it's unclear how I can resolve the issues. Will re-uploading the affected files and then running a forget on the repo simply replace the broken pack with the new version? (Admittedly in a new snapshot, but that's OK.) Steps to reproduce the behavior ------------------------------- Upload ~1.5TB of data from a USB-attached SSD from a Windows server to an S3-hosted repo. If it makes any difference, the upload was rate-limited and interrupted / restarted at a couple of points in order to adjust the rate limit. Unfortunately I'm not sure whether any of the affected files were being uploaded at the time. Do you have any idea what may have caused this? ----------------------------------------------- Possible I/O issue, as the drives are connected to a Windows server via a USB to SATA cable and at least once (although not while restic was running) the event log shows that Windows believes the drive was disconnected and reconnected (it wasn't). Do you have an idea how to solve the issue? ------------------------------------------- Better worked example of recovering from issues raised by `restic check`, particularly when --read-data was used. Did restic help you or made you happy in any way? ------------------------------------------------- Generally I've been very impressed with restic so far.

github.com/restic/restic

Unclear how to recover from "pack ID does not match" errors from "restic check"

opened 04:27PM - 27 Feb 19 UTC

adsbarratt

state: need feedback category: prune

Output of `restic version` -------------------------- restic 0.9.2 compiled wi…th go1.10.3 on windows/amd64 (for initial backup and check) restic 0.9.4 compiled with go1.11.4 on windows/amd64 (for later diagnosis) How did you run restic exactly? ------------------------------- AWS_ACCESS_KEY_ID=foo AWS_SECRET_ACCESS_KEY=foo RESTIC_REPOSITORY=s3:https://s3.amazonaws.com/myrepo RESTIC_PASSWORD=foo:bar Ran `restic check --check-unused --read-data` against the repository after generating 3 snapshots. Output included: > Pack ID does not match, want 7691f738, got e3b0c442 What backend/server/service did you use to store the repository? ---------------------------------------------------------------- S3 Expected behavior ----------------- A simpe means of resolving the error, given that I still have the original data available locally. Actual behavior --------------- `restic find --pack 7691f738` got me as far as > Found blob 1b0ccb7af24b6221798ee900d9f5943e56a186f3f30707e5a7366033415a3e50 ... in file /f/2018-04/foo-Sun.7z (tree 3229cbefa8f2e1b09726ef36f701516d01ae0d8c80339125508ca62e23ab7479) ... in snapshot 5c8529d9 (2019-02-06 15:58:21) Found blob 11e01454070d44043df7536c6b9a2cde366d7aa97a066daf10d1049d4c7d1c55 ... in file /f/2018-04/foo-web-Sun.7z (tree 3229cbefa8f2e1b09726ef36f701516d01ae0d8c80339125508ca62e23ab7479) ... in snapshot 5c8529d9 (2019-02-06 15:58:21) but it's unclear how I can resolve the issues. Will re-uploading the affected files and then running a forget on the repo simply replace the broken pack with the new version? (Admittedly in a new snapshot, but that's OK.) Steps to reproduce the behavior ------------------------------- Upload ~1.5TB of data from a USB-attached SSD from a Windows server to an S3-hosted repo. If it makes any difference, the upload was rate-limited and interrupted / restarted at a couple of points in order to adjust the rate limit. Unfortunately I'm not sure whether any of the affected files were being uploaded at the time. Do you have any idea what may have caused this? ----------------------------------------------- Possible I/O issue, as the drives are connected to a Windows server via a USB to SATA cable and at least once (although not while restic was running) the event log shows that Windows believes the drive was disconnected and reconnected (it wasn't). Do you have an idea how to solve the issue? ------------------------------------------- Better worked example of recovering from issues raised by `restic check`, particularly when --read-data was used. Did restic help you or made you happy in any way? ------------------------------------------------- Generally I've been very impressed with restic so far.

I will make note of all this for use the next time this problem or one like it occurs with “check --read-data”

My thanks to all of you.

MichaelEischer · November 5, 2021, 8:52pm

The most up-to-date guide on how to recover from a damaged repository is at Recover from broken pack file · Issue #828 · restic/restic · GitHub .

Providing a repair option as suggested would require a change of the repository format. It is currently not possible to mark blobs as missing/corrupted, which would be necessary to implement this feature without causing additional problems later on. The alternatives would be to rewrite snapshots to remove the missing blobs or just delete all affected snapshots altogether.

The “Blob ID doesn’t match”-errors indicates that the system creating the backup corrupts data while doing so. If that error is reported consistently by check, then that is de facto the only possible explanation.

T-6 · November 6, 2021, 7:39am

What could be the reason for this? Faulty RAM? Something else?

eric · November 6, 2021, 4:56pm

So, there is actually a repair option mentioned in issue 828. I was not aware of this. It’s just experimental.
I assume that, for blob ID miss-match errors one repair procedure can be similar to that of pack ID errors: we can find the packs in which the damaged blobs are located, remove those packs entirely, run restic rebuild-index, then restic backup, and then restic check.

I removed all pack ID errors using above procedure easily. This process can be automated, and doesn’t need a change of repository format. Errors are found by running restic check.

This approach works as long as source data exists. Otherwise, I am afraid forgetting snapshots or better living with errors are the only options.

MichaelEischer · November 6, 2021, 6:02pm

Faulty RAM is one possibility. However, it is also possible that some part of let’s say the CPU miscomputes something. I’ve seen that recently one a CPU which didn’t get enough power to work correctly while using turbo boost.

Eli6 · November 6, 2021, 10:22pm

Yet, I ran ram tests with memtest86 and hard disk tests with smartctl for half a day and didn’t find any errors. I recall these run some CPU tests as well.

So I am puzzled where these errors come from (mostly Pack ID errors, sometimes blob ID or ciphertext verification). Perhaps the type of flaw that produces restic errors doesn’t show up in above tests.

I begin to suspect that the Ubuntu OS with latest kernel that I am running doesn’t match well with my laptop hardware.

MichaelEischer · November 7, 2021, 4:37pm

Did you try prime95? I’ve had some successes with it to detect CPU problems.

Eli6 · November 7, 2021, 5:15pm

I will try this.

As an update, today I backed up the same source data using both Restic and Borg, running simultaneously, with repository stored locally in client machine.

There are integrity errors in both Restic and Borg, roughly the same numbers. This means the problem is not restic code or server (I thought maybe because restic takes more RAM is prone to more integrity errors).

The number of integrity errors depends on whether the laptop is under load and if it’s on battery.

This means the hardware is most likely problematic somehow. Laptops don’t have ECC, sometimes are dropped and damaged, components are not durable due to size limitations, maybe under powered due to battery, etc.

I am surprised how the laptop (encrypted with LUKS) actually functions with ~ 10 integrity errors/30 mins.

fede · June 21, 2024, 2:11pm

Thanks a lot for the update. Very valuable information because I am having a similar problem