Recovery options for damaged repositories

#1

We’ve seen a lot of questions recently regarding damaged repositories and I think this calls out two areas where we could improve the project:

Documentation for fixing broken repositories

There doesn’t appear to be any official documentation (that I can find) that deals with how to fix a repository that has become broken. I might do a write-up on these methods and submit a PR to include it in the documentation. There’s examples all over the forum, but no central “help me fix my problem” document.

Features for fixing broken repositories

Right now, there’s only a few ways that damage can be corrected. In all cases, running rebuild-index ASAP is of paramount importance because that prevents future backups from creating a broken snapshot due to missing objects being present in the index, and therefore deduplicated by backup clients.

I would go so far as to say that any “object is missing” error message in restic should be accompanied by this advice: run rebuild-index NOW or future backups may be immediately broken.

The simplest fix is to drop snapshots that refer to missing objects, but this is taking a sledgehammer to the problem when tweezers might be more appropriate. This is particularly unhelpful when the missing objects appear in every snapshot – we can’t feasibly recommend that people delete everything.

Another option is to let the problem be if the missing data is still present on at least one backup client, because a future backup operation could restore the object. This is the ideal resolution, but becomes problematic if the damage is not quickly corrected, because prune will refuse to operate on the repository.

If the repository has been mirrored off-site, then it’s possible to find the pack containing the missing object on the mirror and copy it to the broken repository and run rebuild-index to heal it.

If the object exists in another repository but that repository does not share the master key, this can be used to fix the damage but not directly – one has to restore the files/directories that reference the same objects from the other repository and then back them up into the broken repository, hoping that if a tree is missing that the tree object is correctly recreated such that it has the same hash.

Finally, if none of this is possible, we need a way to do one of two things:

  1. Permanently acknowledge to restic that we know that a specific set of objects is missing, such that restic will not consider their absence to be an error.
    • check and prune should not abort due to the missing object, though check might display a non-fatal diagnostic confirming the absence of the objects.
    • If any acknowledged-missing object is re-added to the repository by a future backup operation, it should be removed from the acknowledged-missing list.
    • If a dump operation needs a missing object, it should still fail because it cannot complete.
    • If a restore operation needs a missing object, it should emit a warning and skip over the affected tree/file. (Perhaps an optional flag could instruct restore to fail anyway, if this behavior is desired.)
  2. Cut the missing objects out of all snapshots by rebuilding them into new snapshots, omitting the files/directories that need the missing objects. The old snapshots would be automatically forgotten. (This would be similar to a git rebase or git commit --amend operation; the ID of the snapshots would be permanently changed, but there is already precedent in restic for doing this. See restic tag.)

Each approach has pros and cons.

#1 pros: there is the possibility for future backups to fix the damage; snapshot IDs don’t change.

#1 cons: this change would probably be pretty invasive and touch a lot of the code that reads from a repository; we need a new repository concept to record acknowledged-missing objects.

#2 pros: this approach is very straightforward and requires no change to existing restic code because it would only need to add a new command.

#2 cons: there is no possibility of a future backup healing the damage because the rewritten snapshots are forgotten and permanently altered; snapshot IDs would change, which could confuse systems external to restic that track restic snapshots.

I welcome feedback on all of the above, particularly from @fd0 as he’ll have to approve the PRs. :slight_smile:

4 Likes

#2

Great ideas here, thanks for posting them! I have something else to add: We need to find out why pack files go missing. It feels to me that with the number of reports, there must be a bug. But I don’t have any idea where to start debugging, usually people come to us when it’s already too late. So I’m thinking about adding a “log file” feature, something similar to what git reflog does, which records e.g. the result of a backup or prune operation (which packs were added and removed). Maybe this will help us to confirm that there is a bug…

1 Like

#3

I’m not totally convinced of that yet, as many of the reports of damage have been accompanied by acknowledgements that the reporter either accidentally deleted pack files manually, noticed filesystem corruption, noticed disk failure, or noticed other kinds of hardware failure that could have contributed to a corrupt pack.

(Note that I’m opposed to adding such a log; it would be a helpful diagnostic feature. I’m just not sure that the missing packs point to a restic bug yet. Other explanations seem more likely to me.)

0 Likes