Error while triying to prune a huge repo

dantefff · February 24, 2021, 5:12pm

Hello everyone,
First of all thanks for this nice piece of software and for the help the comunity gives to everyone.

The environment:

I was in restic 0.9.6 and upgraded to restic 0.12.0 to be able to do a more efficient prune.
I am doing prune in local. I make backup to a remote machine, but I have shell access and restic also in that machine so I decided to do prune in local (I prefer to spend resources in teh machine where the repo is located)
The repo is quite big. Almost 8T of data in disk.
I have yet done a forget command

The command (executed from the directory where the repo is located):
restic -r . --cache-dir /foo/bar prune -vv

The output:
repository XXXX opened successfully, password is correct
created new cache in /foo/bar
loading indexes…
loading all snapshots…
finding data that is still in use for 181 snapshots
[3:33:13] 100.00% 181 / 181 snapshots
searching used packs…
collecting packs for deletion and repacking
will remove pack 8c41d0dd as it is unused and not indexed
will remove pack 8ce999df as it is unused and not indexed
[…]
will remove pack f3c27e3b as it is unused and not indexed
pack ec908d7a: calculated size 1744769 does not match real size 4224015
[3:32] 51.58% 894714 / 1734614 packs processed
Fatal: pack size does not match calculated size from index

I’m not sure how to continue from here. Is this a corrupted repository?
Can I do something to repair it?

Any help is appreciated

akrabu · February 24, 2021, 9:19pm

I’d try

restic rebuild-index

then

restic check --read-data

If you find errors, you can, for example, search for the affected trees by doing:

restic find --tree ABCD1234

It will then print out all the snapshots that reference that tree. You can then restic forget the snapshot IDs. Afterwards, your database should pass another restic check --read-data and you’re safe to prune.

alexweiss · February 24, 2021, 9:55pm

@dantefff You have a pack file that should have another size (calculated by the index entries) than it actually has (as reported by listing the files in your repository) The file should be located in /data/ec/ec908d7a....

Unfortunately, this file is needed (else prune would simply delete it), so yes, your repo is corrupt.

Before trying to repair, you should run a restic check (and if access to your repo is cheap, even with --read-data) to see what error that reports. It should also report at least the same file size mismatch.

If access to your repo is expensive, manually download those corrupt files and run a sha256sum to check if the file is really corrupt. This is automatically done if you run check with --read-data.

This helps, if those files are valid but the index isn’t correct (for whatever reason). If the files are not valid, it “just” helps to remove them and the referenced blobs from the index. Hence if the pack files are corrupt, you should still see errors during a check (now errors that blobs are missing)

This is always a good idea, but note that --read-data downloads all files. If this is expensive, I wouln’t do that for a large repo. As written above, you can download suspicious pack files and manually check the sha256. There is also

github.com/restic/restic

check: Add option `--read-data-from`

restic:master ← aawsome:check-read-data-from

opened 01:23PM - 30 Dec 20 UTC

aawsome

+94 -0

What does this PR change? What problem does it solve? -------------------------…---------------------------- Adds an option to `check` to read the pack files to be checked from a file. This is handy to control which files are actually read by `check`. Also allows to only check a few pack files in troubleshooting cases without needing to download all pack files (which might be very expensive and time-consuming) Was the change discussed in an issue or in the forum before? ------------------------------------------------------------ see #3202 In the forum there was a discussion (don't find it right now) where a user wanted to troubleshoot some corrupt pack file but rejected to run a full `check --read-data` as this would cost too much... Checklist --------- - [x] I have read the [Contribution Guidelines](https://github.com/restic/restic/blob/master/CONTRIBUTING.md#providing-patches) - [x] I have enabled [maintainer edits for this PR](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/allowing-changes-to-a-pull-request-branch-created-from-a-fork) - [ ] I have added tests for all changes in this PR - [x] I have added documentation for the changes (in the manual) - [x] There's a new file in `changelog/unreleased/` that describes the changes for our users (template [here](https://github.com/restic/restic/blob/master/changelog/TEMPLATE)) - [x] I have run `gofmt` on the code in all commits - [x] All commit messages are formatted in the same style as [the other commits in the repo](https://github.com/restic/restic/blob/master/CONTRIBUTING.md#git-commits) - [x] I'm done, this Pull Request is ready for review

which allows you to only give specific files to check.

A even better first try is to check if you can redo you backups for those snapshots. If you had run an rebuild-index and now blobs are missing, those blobs will be added if you run a backup and the blobs are still available on some files on you hard disc.

If that is not the case, then your repo is corrupt and cannot be completely repaired without loosing some data. You can forget the affected snapshots or, use this not yet reviewed PR which will find the snapshots for you and tries to salvage as much data as possible from affected snapshots:

github.com/restic/restic

Add repair command

restic:master ← aawsome:new-repair-command

opened 07:42PM - 05 Aug 20 UTC

aawsome

+900 -107

What does this PR change? What problem does it solve? -------------------------…---------------------------- Allow users to recover from broken repositories/snapshots while still salvaging the sane parts of the repository/snapshot. For given snapshots (selection identical to, e.g., `forget`) the command tries to read all trees and checks if the needed blobs are contained in the index. If blobs are missing or trees cannot be read, it will create new trees and snapshots which only miss these "defect" parts. Those newly generated snapshots can be used to recover needed data. Also, after removing the "defect" snapshots, `prune` is able to clean up the repo again. While this command is able to cause data loss, special care is taken such that the default flags won't do any harm - in fact, users have to explicitly specify `--dry-run=false --delete` to loose data. Output looks like: ``` ./restic -r /home/thinkpad/repo.index-missingblob2/repo repair note: --dry-run is set -> repair will only show what it would do. enter password for repository: repository b270637f opened successfully, password is correct check and repair 1 snapshots <Snapshot f22c6d3a of [/home/thinkpad/data] at 2020-07-09 11:18:26.501071439 +0200 CEST by alex@thinkpad>: removed defect file '/home/thinkpad/data/test' would have modified tree 705adc0d would have modified tree 1ee0d0e1 would have modified tree 98c948be would have modified tree 9d89d7fe would have repaired snpshot f22c6d3a. [0:00] 100.00% 1 / 1 snapshots ``` Depends on #2878 for the troubleshooting docu update. Was the change discussed in an issue or in the forum before? ------------------------------------------------------------ Closes #1759 Closes #1798 Closes #2334 Checklist --------- - [x] I have read the [Contribution Guidelines](https://github.com/restic/restic/blob/master/CONTRIBUTING.md#providing-patches) - [x] I have enabled [maintainer edits for this PR](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/allowing-changes-to-a-pull-request-branch-created-from-a-fork) - [x] I have added tests for all changes in this PR - [x] I have added documentation for the changes (in the manual) - [x] There's a new file in `changelog/unreleased/` that describes the changes for our users (template [here](https://github.com/restic/restic/blob/master/changelog/TEMPLATE)) - [x] I have run `gofmt` on the code in all commits - [x] All commit messages are formatted in the same style as [the other commits in the repo](https://github.com/restic/restic/blob/master/CONTRIBUTING.md#git-commits) - [x] I'm done, this Pull Request is ready for review

dantefff · February 25, 2021, 9:10am

Thanks a lot @akrabu and @alexweiss for your instructions.
If I have understood right:

As I have direct access from local to the repository I can easily do a sha256sum for the ec908d7a… file (indeed the sha256sum looks OK to me. Same as the file name).
As the access to my repo is cheap (I can do it from a inhouse local machine), I’m now doing a restic -r . --cache-dir /foo/bar check --read-data
The restic check --read-data will find index errors and also data integrity errors (sha256sum mismatches).
If only index errors are encountered, a restic rebuild-index would be enough to sanityze de repo?
If there are also data integrity errors, after a restic rebuild-index I can do a new backup and if blobs are still in the original data they will be used to rebuild damaged backups. Is this right?
After making a new backup, if there are still some missing blobs (how can I see it? with a new restic check?) I can try to recover the damaged data with PR aawsome:new-repair-command or if I can live with that, forget the damaged snapshots located with restic find --tree ec908d7a

Is this this right?

alexweiss · February 26, 2021, 6:32am

Doing a check --read-data is even better in your situation

check --read-data reads all pack files and does even more checks than only a SHA256. It also decrypts all files, checks the blob hashes and compares them with the index.

In general, if check reports errors you do not expect, you should first try to find the root cause for them. They might indicate hardware problems or other severe things you definitively want to check out before continuing to rely on your backups!

rebuild-index does what the name stands for: It rebuilds the index. The point is that check can do many checks only if the index correctly represents the pack files. So after rebuild-index, always run another check.

Not exactly. If you have blobs missing from the index those will be re-saved during a backup run. So if the only errors that remain after rebuild-index can be healed by re-saving blobs (and those blobs are still generated by the data to backup), your repository will be healed. In this case, backup will print out some warnings. Also make sure you again run another check after that backup you assume should heal the repo.

Exactly. check will report missing blobs and once you remove the snapshots that need those blobs, your repo is in a sane state. The PR creates new snapshots which only rely on blobs that are present in the index and can remove (if you explicitely specify it) “defect” snapshots.

After you reached a sane state (and made sure that nothing is missing), you can run a prune to remove remaining unused data.

dantefff · March 1, 2021, 9:05am

Thanks a lot for your explanation. Just for completion, my repository only encountered that pack size mismatch error.

As you suggest in your comment, I made a disk check and everything looks good to me. May be the pack size mismatch error caused by network issues?

Anyway I rebuilt index made another check and now I’m pruning as expected. Thanks a lot.

MichaelEischer · March 4, 2021, 6:53pm

The interesting part about that error message is that the index did not contain all blobs which exist in the pack file (the calculated size is less than the real size). So it’s not the pack file which is incomplete but rather the repository index. In restic 0.9.6 it was possible that a part of the blobs of a pack file are listed in one index and the second part in another index. So maybe here only the first index was uploaded and the backup got interrupted afterwards.

I’m not sure whether that’s what has happened here, but it would be a possible scenario.