We’ve seen a lot of questions recently regarding damaged repositories and I think this calls out two areas where we could improve the project:
Documentation for fixing broken repositories
There doesn’t appear to be any official documentation (that I can find) that deals with how to fix a repository that has become broken. I might do a write-up on these methods and submit a PR to include it in the documentation. There’s examples all over the forum, but no central “help me fix my problem” document.
Features for fixing broken repositories
Right now, there’s only a few ways that damage can be corrected. In all cases, running rebuild-index
ASAP is of paramount importance because that prevents future backups from creating a broken snapshot due to missing objects being present in the index, and therefore deduplicated by backup clients.
I would go so far as to say that any “object is missing” error message in restic should be accompanied by this advice: run rebuild-index
NOW or future backups may be immediately broken.
The simplest fix is to drop snapshots that refer to missing objects, but this is taking a sledgehammer to the problem when tweezers might be more appropriate. This is particularly unhelpful when the missing objects appear in every snapshot – we can’t feasibly recommend that people delete everything.
Another option is to let the problem be if the missing data is still present on at least one backup client, because a future backup operation could restore the object. This is the ideal resolution, but becomes problematic if the damage is not quickly corrected, because prune
will refuse to operate on the repository.
If the repository has been mirrored off-site, then it’s possible to find the pack containing the missing object on the mirror and copy it to the broken repository and run rebuild-index
to heal it.
If the object exists in another repository but that repository does not share the master key, this can be used to fix the damage but not directly – one has to restore the files/directories that reference the same objects from the other repository and then back them up into the broken repository, hoping that if a tree is missing that the tree object is correctly recreated such that it has the same hash.
Finally, if none of this is possible, we need a way to do one of two things:
- Permanently acknowledge to restic that we know that a specific set of objects is missing, such that restic will not consider their absence to be an error.
-
check
andprune
should not abort due to the missing object, thoughcheck
might display a non-fatal diagnostic confirming the absence of the objects. - If any acknowledged-missing object is re-added to the repository by a future backup operation, it should be removed from the acknowledged-missing list.
- If a
dump
operation needs a missing object, it should still fail because it cannot complete. - If a
restore
operation needs a missing object, it should emit a warning and skip over the affected tree/file. (Perhaps an optional flag could instructrestore
to fail anyway, if this behavior is desired.)
-
- Cut the missing objects out of all snapshots by rebuilding them into new snapshots, omitting the files/directories that need the missing objects. The old snapshots would be automatically forgotten. (This would be similar to a
git rebase
orgit commit --amend
operation; the ID of the snapshots would be permanently changed, but there is already precedent in restic for doing this. Seerestic tag
.)
Each approach has pros and cons.
#1 pros: there is the possibility for future backups to fix the damage; snapshot IDs don’t change.
#1 cons: this change would probably be pretty invasive and touch a lot of the code that reads from a repository; we need a new repository concept to record acknowledged-missing objects.
#2 pros: this approach is very straightforward and requires no change to existing restic code because it would only need to add a new command.
#2 cons: there is no possibility of a future backup healing the damage because the rewritten snapshots are forgotten and permanently altered; snapshot IDs would change, which could confuse systems external to restic that track restic snapshots.
I welcome feedback on all of the above, particularly from @fd0 as he’ll have to approve the PRs.