Fatal: repository contains errors => how best to respond?

Weekly, I run “prune” and “check --read-data” on all my repositories, which are merely backups to my /home account on my Kubuntu Linux OS. Last run I got this error:

create exclusive lock for repository
load indexes
check all packs
check snapshots, trees and blobs
[0:04] 100.00%  11 / 11 snapshots
read all data
pack 9302e12c contains 1 errors: [blob 0: ciphertext verification failed]
[31:28] 100.00%  7127 / 7127 packs
Fatal: repository contains errors

Is there any recovery possible here? I realize I could just delete and recreate the repository, but I’m wondering is there might be a less drastic solution.

UPDATE as of Nov. 3, 12:42 PM PDT (Seattle) =====

I’m glad to have stimulated a discussion here. One overarching impression I get is that the error is actually a group of errors AND that the error message isn’t telling us forever-amateurs much. THAT can be fixed. Profile - rawtaz - restic forum speaks of the corruption’s being outside of restic. Well, obviously, but WHAT is the corruption? Restic isn’t telling all it knows, I suspect. And most importantly, what is an appropriate response?

So far the only one I can grasp is to recreate the repository. That, at least, restarts my twice-daily backup cycle, though not without cost.

2 Likes

I wouldn’t expect that you need to start over, better see if you can fix this one up. I was about to write some steps for that but realized that @MichaelEischer might have some investigation suggestions of interest first.

These errors are quite common and are bothering me. I get tens of Pack and Blob ID don’t match errors.

Could an option be added to fix this?

For pack ID errors, if you still have the data source, you can remove the damaged packs from the repository, rebuild-index and backup again.

How about blob IDs? I suppose you can find their packs and remove those.

See the response of @MichaelEischer in issue 2191 in GitHub below. It’s few lines of code to automate this process, which is painstaking and error prune to do manually when there are tens of damaged packs.

I don’t know why a repair option is not added to restic. It may not solve all types of errors, but pack and blob IDs errors can be fixed.

What is this even supposed to mean? You are getting those messages because something is wrong in your infrastructure. How do you propose that we “add an option” to fix problems with your infrastructure? Seriously. The only “fix” to the problem is to find the root cause of those errors and fix that.

The source of the problem doesn’t matter. You can see that the OP, as well as many others, also report the same issue. An error could arise from anywhere.

And the procedure to fix that is clear (see issue 2191 above). I understand you may not be able to do it. Thus, I hope that the restic developers @fd0 or @MichaelEischer would chime in about the possibility of adding a repair option that would try to fix some of the errors.

These are sometimes parts of the back up software, see the option repair in Borg.

You can take measures to fix a corrupted repository after the fact, but your suggestion to add an option to prevent the corruption in the first place, when that corruption is outside of restic, is obviously not possible in the real world.

There seems to be no more discussion on this issue. Reviewing Github issue 2191 - Unclear how to recover from "pack ID does not match" errors from "restic check" · Issue #2191 · restic/restic · GitHub - I see a lot of information there, much of which is over my head.

I do see potentially useful ideas here:

and possibly here:

I will make note of all this for use the next time this problem or one like it occurs with “check --read-data”

My thanks to all of you.

The most up-to-date guide on how to recover from a damaged repository is at Recover from broken pack file · Issue #828 · restic/restic · GitHub .

Providing a repair option as suggested would require a change of the repository format. It is currently not possible to mark blobs as missing/corrupted, which would be necessary to implement this feature without causing additional problems later on. The alternatives would be to rewrite snapshots to remove the missing blobs or just delete all affected snapshots altogether.

The “Blob ID doesn’t match”-errors indicates that the system creating the backup corrupts data while doing so. If that error is reported consistently by check, then that is de facto the only possible explanation.

1 Like

What could be the reason for this? Faulty RAM? Something else?

  • So, there is actually a repair option mentioned in issue 828. I was not aware of this. It’s just experimental.

  • I assume that, for blob ID miss-match errors one repair procedure can be similar to that of pack ID errors: we can find the packs in which the damaged blobs are located, remove those packs entirely, run restic rebuild-index, then restic backup, and then restic check.

I removed all pack ID errors using above procedure easily. This process can be automated, and doesn’t need a change of repository format. Errors are found by running restic check.

This approach works as long as source data exists. Otherwise, I am afraid forgetting snapshots or better living with errors are the only options.

Faulty RAM is one possibility. However, it is also possible that some part of let’s say the CPU miscomputes something. I’ve seen that recently one a CPU which didn’t get enough power to work correctly while using turbo boost.

1 Like

Yet, I ran ram tests with memtest86 and hard disk tests with smartctl for half a day and didn’t find any errors. I recall these run some CPU tests as well.

So I am puzzled where these errors come from (mostly Pack ID errors, sometimes blob ID or ciphertext verification). Perhaps the type of flaw that produces restic errors doesn’t show up in above tests.

I begin to suspect that the Ubuntu OS with latest kernel that I am running doesn’t match well with my laptop hardware.

Did you try prime95? I’ve had some successes with it to detect CPU problems.

I will try this.

As an update, today I backed up the same source data using both Restic and Borg, running simultaneously, with repository stored locally in client machine.

There are integrity errors in both Restic and Borg, roughly the same numbers. This means the problem is not restic code or server (I thought maybe because restic takes more RAM is prone to more integrity errors).

The number of integrity errors depends on whether the laptop is under load and if it’s on battery.

This means the hardware is most likely problematic somehow. Laptops don’t have ECC, sometimes are dropped and damaged, components are not durable due to size limitations, maybe under powered due to battery, etc.

I am surprised how the laptop (encrypted with LUKS) actually functions with ~ 10 integrity errors/30 mins.

1 Like