Recovery options for damaged repositories

cdhowie · April 10, 2019, 12:21am

We’ve seen a lot of questions recently regarding damaged repositories and I think this calls out two areas where we could improve the project:

Documentation for fixing broken repositories

There doesn’t appear to be any official documentation (that I can find) that deals with how to fix a repository that has become broken. I might do a write-up on these methods and submit a PR to include it in the documentation. There’s examples all over the forum, but no central “help me fix my problem” document.

Features for fixing broken repositories

Right now, there’s only a few ways that damage can be corrected. In all cases, running rebuild-index ASAP is of paramount importance because that prevents future backups from creating a broken snapshot due to missing objects being present in the index, and therefore deduplicated by backup clients.

I would go so far as to say that any “object is missing” error message in restic should be accompanied by this advice: run rebuild-index NOW or future backups may be immediately broken.

The simplest fix is to drop snapshots that refer to missing objects, but this is taking a sledgehammer to the problem when tweezers might be more appropriate. This is particularly unhelpful when the missing objects appear in every snapshot – we can’t feasibly recommend that people delete everything.

Another option is to let the problem be if the missing data is still present on at least one backup client, because a future backup operation could restore the object. This is the ideal resolution, but becomes problematic if the damage is not quickly corrected, because prune will refuse to operate on the repository.

If the repository has been mirrored off-site, then it’s possible to find the pack containing the missing object on the mirror and copy it to the broken repository and run rebuild-index to heal it.

If the object exists in another repository but that repository does not share the master key, this can be used to fix the damage but not directly – one has to restore the files/directories that reference the same objects from the other repository and then back them up into the broken repository, hoping that if a tree is missing that the tree object is correctly recreated such that it has the same hash.

Finally, if none of this is possible, we need a way to do one of two things:

Permanently acknowledge to restic that we know that a specific set of objects is missing, such that restic will not consider their absence to be an error.
- check and prune should not abort due to the missing object, though check might display a non-fatal diagnostic confirming the absence of the objects.
- If any acknowledged-missing object is re-added to the repository by a future backup operation, it should be removed from the acknowledged-missing list.
- If a dump operation needs a missing object, it should still fail because it cannot complete.
- If a restore operation needs a missing object, it should emit a warning and skip over the affected tree/file. (Perhaps an optional flag could instruct restore to fail anyway, if this behavior is desired.)
Cut the missing objects out of all snapshots by rebuilding them into new snapshots, omitting the files/directories that need the missing objects. The old snapshots would be automatically forgotten. (This would be similar to a git rebase or git commit --amend operation; the ID of the snapshots would be permanently changed, but there is already precedent in restic for doing this. See restic tag.)

Each approach has pros and cons.

#1 pros: there is the possibility for future backups to fix the damage; snapshot IDs don’t change.

#1 cons: this change would probably be pretty invasive and touch a lot of the code that reads from a repository; we need a new repository concept to record acknowledged-missing objects.

#2 pros: this approach is very straightforward and requires no change to existing restic code because it would only need to add a new command.

#2 cons: there is no possibility of a future backup healing the damage because the rewritten snapshots are forgotten and permanently altered; snapshot IDs would change, which could confuse systems external to restic that track restic snapshots.

I welcome feedback on all of the above, particularly from @fd0 as he’ll have to approve the PRs.

fd0 · April 10, 2019, 6:36am

Great ideas here, thanks for posting them! I have something else to add: We need to find out why pack files go missing. It feels to me that with the number of reports, there must be a bug. But I don’t have any idea where to start debugging, usually people come to us when it’s already too late. So I’m thinking about adding a “log file” feature, something similar to what git reflog does, which records e.g. the result of a backup or prune operation (which packs were added and removed). Maybe this will help us to confirm that there is a bug…

cdhowie · April 10, 2019, 2:46pm

I’m not totally convinced of that yet, as many of the reports of damage have been accompanied by acknowledgements that the reporter either accidentally deleted pack files manually, noticed filesystem corruption, noticed disk failure, or noticed other kinds of hardware failure that could have contributed to a corrupt pack.

(Note that I’m not opposed to adding such a log; it would be a helpful diagnostic feature. I’m just not sure that the missing packs point to a restic bug yet. Other explanations seem more likely to me.)

cdhowie · June 5, 2019, 4:08pm

@fd0 I guess I need to walk back my statement… we’ve just been hit by this. We’re missing a tree and a few blobs. Thankfully we caught it quickly.

I should point out that I pull from the main repository with rclone copy --immutable from an offsite computer daily. Nothing is ever deleted on the offsite computer, and a restic check on that system still found that a tree and some objects were missing. This would very strongly imply that the problem lies in some way with restic backup and not with restic prune, as we never prune the offsite system.

That at least gives us a direction to look.

durval · June 6, 2019, 2:36am

In a word: ouch. In more words: after fighting a 2-month uphill battle and finally succeeding in having restic backup my entire 30TB local backup to the cloud, I’m very interested in seeing a diagnosis and then a resolution to this A.S.A.P… :-/

Cheers,
– Durval.

saviodsouza · July 16, 2019, 12:57pm

Does check --read-data confirm that the repository is damage free

It’s frightening the read reports like these

cdhowie · July 16, 2019, 4:19pm

As long as the snapshot list is what you expect to see, yes. This operation confirms linkage of all files and directories, as well as reads all blobs in the repository and verifies their hash is correct.

pascal · July 29, 2019, 11:01am

Hi guys,

first of all: thank you so much for what appears to be a great tool! I finished my first backup of 250GB and it worked out very smoothly. Thank you!

However the information in this thread really makes me worry. Do I understand it correctly that at anytime I could destroy my whole backup purely by running “restic backup”? Does this mean I always should keep a copy of the whole backup before running a backup?

If I want to ensure not to loose anything, would it be enough to make a copy of the config, index, keys, and snapshot dirs or do I additionally need a copy of the whole data dir?

@matt How do you deal with this in Relica? I would like to use Relica to backup the IT in my brother’s pastry shop but he would not be able to manually repair anything on the command line.

I think there should be a GitHub Issue for this thing but I cannot find one. Did I miss it?

Thanky you for your support and for all your efforts!
Pascal

rawtaz · July 29, 2019, 11:40am

Is there any indication suggesting that these problems are not due to hardware issues? E.g. in @cdhowie’s case, can you verify with certainty that it’s not a hardware problem?

cdhowie · July 29, 2019, 4:21pm

No, I cannot prove a negative.

rawtaz · July 29, 2019, 4:33pm

So for all we know every one of these could still be hardware issues Unless someone is running a damaged repository on a filesystem with such redundancy that it is practically sure that the files on disk haven’t been corrupted.

cdhowie · July 29, 2019, 4:35pm

The damaged repository in my case is on a RAID10.

rawtaz · July 29, 2019, 4:52pm

Fine, but that doesn’t say much IMO. I’ve had expensive RAID controllers screw up. What I’d consider a more certain storage would be a ZFS with some nice redundancy configuration that can really verify the blocks.

cdhowie · July 29, 2019, 5:23pm

FWIW it’s dmraid. I’m not ruling out hardware but at this point I consider it extremely unlikely as this would be the only symptom we’ve observed, and it was over a month ago at this point without a recurrence.

I’d also point out that I don’t think disk failure would cause this problem, as the pack checksums match. It’s way more reasonable for it to be bad RAM/CPU on the machine running restic backup, IMO.

matt · July 29, 2019, 10:45pm

Do you mean NOT opposed? I think it would be a great thing to have, sooner rather than later.

I just checked, and we have a TODO in the code for repairs after we find out how to efficiently and safely perform them in an automated fashion.

cdhowie · July 29, 2019, 10:47pm

Oops, yes, that’s what I meant. I’ve corrected my post.

alexweiss · August 6, 2020, 11:13am

@cdhowie
I proposed a PR which implements your solution #2:

Additionally, your #1 solution is already partly implemented: Most commands can deal (more or less) with missing data and will not fail in general; more over the backup command provides self-healing if a missing blob is to be backuped again.