Restic failing with id <hex string> not found in repository

amuckart · April 8, 2021, 6:26am

Note: This is not the same as Invalid repo: id 00...00 not found in repository the ID I’m getting a failure on is not all zeroes.

Restic Version
restic 0.12.0 compiled with go1.15.8 on linux/amd64

Background
I am backing up to an S3-compatible object store run by an NZ cloud provider.

I’m running restic out of a script that backs up dirvish backups. The script gets run by cron. The script does a ‘restic check’ before it writes anything.

Problem
At some point the filesystem that restic’s cache and temp dirs are in filled up because restic’s cache got unexpectedly huge, and this caused a backup to abort. I got this in the log files:

Save(<data/345ab7bbb9>) returned error, retrying after 552.330144ms: wrote 0 bytes instead of the expected 4216554 bytes
Save(<data/0908acbbf6>) returned error, retrying after 720.254544ms: wrote 0 bytes instead of the expected 4219694 bytes
Save(<data/12b2c28eb2>) returned error, retrying after 582.280027ms: wrote 0 bytes instead of the expected 4234270 bytes
Save(<data/32b051dd65>) returned error, retrying after 468.857094ms: wrote 0 bytes instead of the expected 4277702 bytes
Save(<data/23ea50b473>) returned error, retrying after 462.318748ms: wrote 0 bytes instead of the expected 4258146 bytes
Save(<data/02ecd222db>) returned error, retrying after 593.411537ms: wrote 0 bytes instead of the expected 4263202 bytes
Save(<data/4733e82179>) returned error, retrying after 282.818509ms: wrote 0 bytes instead of the expected 4322388 bytes
Save(<data/28c76aab43>) returned error, retrying after 328.259627ms: wrote 0 bytes instead of the expected 4237903 bytes
Save(<data/3c4e25d1f9>) returned error, retrying after 298.484759ms: wrote 0 bytes instead of the expected 4340312 bytes
Fatal: unable to save snapshot: Copy: write /var/backups/dirvish/spool/restic/cache/48150b9e7e12c6944e65b22a6fc7fd45ad297c2c1da1521be4343cc2676749d1/data/3c/3c4e25d1f9e967cad52875bbea8c914295aeed0a150427c78e366474939e0148: no space left on device

Unfortunately this happened while I was away so the script kept running for a few days until I got back. Restic now has it’s own volume to use and hasn’t filled it up.

After it happened the restic check the script runs was reporting “pack does not exist” but a rebuild-index fixed those problems.

Now I have an “id not found” error whenever I run restic check:

# restic check --cache-dir=/var/backups/restic/cache/
using temporary cache in /var/backups/restic/cache/restic-check-cache-643850715
repository 920b7fbf opened successfully, password is correct
created new cache in /var/backups/restic/cache/restic-check-cache-643850715
create exclusive lock for repository
load indexes
check all packs
check snapshots, trees and blobs
error for tree c700f213:snapshots
  id c700f2138c6800e828c52ea351933372d9a19756a58cdca40770c23139167df4 not found in repository
[37:54] 100.00%  205 / 205 snapshots
Fatal: repository contains errors

rebuild-index isn’t helping, and restic prune fails with the following:

# restic prune
repository 920b7fbf opened successfully, password is correct
loading indexes...
loading all snapshots...
finding data that is still in use for 205 snapshots
[1:34] 12.68%  26 / 205 snapshots
id c700f2138c6800e828c52ea351933372d9a19756a58cdca40770c23139167df4 not found in repository
github.com/restic/restic/internal/repository.(*Repository).LoadBlob
	/restic/internal/repository/repository.go:162
github.com/restic/restic/internal/repository.(*Repository).LoadTree
	/restic/internal/repository/repository.go:728
github.com/restic/restic/internal/restic.loadTreeWorker
	/restic/internal/restic/tree_stream.go:37
github.com/restic/restic/internal/restic.StreamTrees.func1
	/restic/internal/restic/tree_stream.go:164
golang.org/x/sync/errgroup.(*Group).Go.func1
	/home/build/go/pkg/mod/golang.org/x/sync@v0.0.0-20200625203802-6e8e738ad208/errgroup/errgroup.go:57
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1374

I don’t know how to find tree c700f213 it talks about, but I can’t find the id fingerprint in the output of any 'restic list ’ command.

I don’t want to go writing more data to the repository until the error is fixed, but I’m out of ideas about how to fix it so any help would be greatly appreciated.

Thanks.

gurkan · April 8, 2021, 8:28am

One fast way would be checking which snapshots that object touches:

restic find --pack c700f2138c6800e828c52ea351933372d9a19756a58cdca40770c23139167df4

And forgetting the snapshots which depends on it, if it’s acceptable.

alexweiss · April 8, 2021, 5:56pm

@amuckart First you should run a check to see if other pack files are also affected.

Doing more backups won’t make too much harm to the repo, but be prepared that the created backups could be also defect in some way.
With 0.12.0, in contrast, running a backup can even heal your repo from errors if you ran an rebuild-index before. This is IMO the first and preferred way to “heal” a broken repo.

As an alternative, you can use this experimental PR which can modify snapshots such that only missing data is removed from the snapshot(s). In your case whole trees seem to be missing, so repair might as well through away large parts of you snapshot(s).

github.com/restic/restic

Add repair command

restic:master ← aawsome:new-repair-command

opened 07:42PM - 05 Aug 20 UTC

aawsome

+900 -107

What does this PR change? What problem does it solve? -------------------------…---------------------------- Allow users to recover from broken repositories/snapshots while still salvaging the sane parts of the repository/snapshot. For given snapshots (selection identical to, e.g., `forget`) the command tries to read all trees and checks if the needed blobs are contained in the index. If blobs are missing or trees cannot be read, it will create new trees and snapshots which only miss these "defect" parts. Those newly generated snapshots can be used to recover needed data. Also, after removing the "defect" snapshots, `prune` is able to clean up the repo again. While this command is able to cause data loss, special care is taken such that the default flags won't do any harm - in fact, users have to explicitly specify `--dry-run=false --delete` to loose data. Output looks like: ``` ./restic -r /home/thinkpad/repo.index-missingblob2/repo repair note: --dry-run is set -> repair will only show what it would do. enter password for repository: repository b270637f opened successfully, password is correct check and repair 1 snapshots <Snapshot f22c6d3a of [/home/thinkpad/data] at 2020-07-09 11:18:26.501071439 +0200 CEST by alex@thinkpad>: removed defect file '/home/thinkpad/data/test' would have modified tree 705adc0d would have modified tree 1ee0d0e1 would have modified tree 98c948be would have modified tree 9d89d7fe would have repaired snpshot f22c6d3a. [0:00] 100.00% 1 / 1 snapshots ``` Depends on #2878 for the troubleshooting docu update. Was the change discussed in an issue or in the forum before? ------------------------------------------------------------ Closes #1759 Closes #1798 Closes #2334 Checklist --------- - [x] I have read the [Contribution Guidelines](https://github.com/restic/restic/blob/master/CONTRIBUTING.md#providing-patches) - [x] I have enabled [maintainer edits for this PR](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/allowing-changes-to-a-pull-request-branch-created-from-a-fork) - [x] I have added tests for all changes in this PR - [x] I have added documentation for the changes (in the manual) - [x] There's a new file in `changelog/unreleased/` that describes the changes for our users (template [here](https://github.com/restic/restic/blob/master/changelog/TEMPLATE)) - [x] I have run `gofmt` on the code in all commits - [x] All commit messages are formatted in the same style as [the other commits in the repo](https://github.com/restic/restic/blob/master/CONTRIBUTING.md#git-commits) - [x] I'm done, this Pull Request is ready for review

gurkan · April 8, 2021, 6:14pm

Ah I forgot about that one. Now reading it thoroughly, I don’t think I’ll suggest forgetting the whole snapshot anymore. Great work

amuckart · April 8, 2021, 8:16pm

I tried that but check was failing on the one error and exiting. Is there something else I should have done to see if there were other errors?

Ahh, that is good to know, thank you! I will look at replacing the ‘restic check’ at the beginning of my script with a rebuild-index.

Thank you, that’s a useful looking feature. Unfortunately because this is a production system I can’t easily get away with pulling experimental PRs into it, but because what I’m backing up are dirvish backups, and the script I wrote will recreate any that it things are missing, I can safely just remove the whole snapshot as @gurkan suggested.

Thank you.

amuckart · April 8, 2021, 8:50pm

gurkan:

One fast way would be checking which snapshots that object touches:
restic find --pack c700f2138c6800e828c52ea351933372d9a19756a58cdca40770c23139167df4
And forgetting the snapshots which depends on it, if it’s acceptable.

Thank you, that got me on the right path. Doing a ‘find --pack’ didn’t find anything, but find --tree did:

restic find --tree c700f2138c6800e828c52ea351933372d9a19756a58cdca40770c23139167df4
repository 920b7fbf opened successfully, password is correct
Unable to load tree c700f2138c6800e828c52ea351933372d9a19756a58cdca40770c23139167df4
 ... which belongs to snapshot dd0eefb6d31a4f08d30dd925b73279d1bffe5f4ba034f5b661b9684ec7d77471.

Because I’m backing up dirvish backups and the wrapper script I wrote will re-create any backups it thinks are missing from restic, I can safely delete the snapshot, and that’s fixed the issue.

restic check --cache-dir=/var/backups/restic/cache/
using temporary cache in /var/backups/restic/cache/restic-check-cache-427893504
repository 920b7fbf opened successfully, password is correct
created new cache in /var/backups/restic/cache/restic-check-cache-427893504
create exclusive lock for repository
load indexes
check all packs
check snapshots, trees and blobs
[40:29] 100.00%  204 / 204 snapshots
no errors were found

Thank you for getting me on the right path to the solution!

alexweiss · April 9, 2021, 4:55am

Ah, you are right. In your case a tree blob was missing. I think we could improve the error handling of check in this case…

Note that rebuild-index does not check the repo as check would. On the other hand there is not much harm in running rebuild-index frequently. It however will

lock the repo forbidding any other action (this may be a problem for repos that are used for multiple backups)
need to scan (i.e. get the list of) all pack files. (This can be quite time consuming depending on the backend. But for a sane repo and using 0.12.0, rebuild-index should be always faster than a check.)

I was thinking to add the check for packfile existence to the backup command, but discarded it because the scan can be much more time-consuming than the actual backup (and would therefore slow people down who want to do backups very frequently).