Restic failing with id <hex string> not found in repository

Note: This is not the same as Invalid repo: id 00...00 not found in repository the ID I’m getting a failure on is not all zeroes.

Restic Version
restic 0.12.0 compiled with go1.15.8 on linux/amd64

Background
I am backing up to an S3-compatible object store run by an NZ cloud provider.

I’m running restic out of a script that backs up dirvish backups. The script gets run by cron. The script does a ‘restic check’ before it writes anything.

Problem
At some point the filesystem that restic’s cache and temp dirs are in filled up because restic’s cache got unexpectedly huge, and this caused a backup to abort. I got this in the log files:

Save(<data/345ab7bbb9>) returned error, retrying after 552.330144ms: wrote 0 bytes instead of the expected 4216554 bytes
Save(<data/0908acbbf6>) returned error, retrying after 720.254544ms: wrote 0 bytes instead of the expected 4219694 bytes
Save(<data/12b2c28eb2>) returned error, retrying after 582.280027ms: wrote 0 bytes instead of the expected 4234270 bytes
Save(<data/32b051dd65>) returned error, retrying after 468.857094ms: wrote 0 bytes instead of the expected 4277702 bytes
Save(<data/23ea50b473>) returned error, retrying after 462.318748ms: wrote 0 bytes instead of the expected 4258146 bytes
Save(<data/02ecd222db>) returned error, retrying after 593.411537ms: wrote 0 bytes instead of the expected 4263202 bytes
Save(<data/4733e82179>) returned error, retrying after 282.818509ms: wrote 0 bytes instead of the expected 4322388 bytes
Save(<data/28c76aab43>) returned error, retrying after 328.259627ms: wrote 0 bytes instead of the expected 4237903 bytes
Save(<data/3c4e25d1f9>) returned error, retrying after 298.484759ms: wrote 0 bytes instead of the expected 4340312 bytes
Fatal: unable to save snapshot: Copy: write /var/backups/dirvish/spool/restic/cache/48150b9e7e12c6944e65b22a6fc7fd45ad297c2c1da1521be4343cc2676749d1/data/3c/3c4e25d1f9e967cad52875bbea8c914295aeed0a150427c78e366474939e0148: no space left on device

Unfortunately this happened while I was away so the script kept running for a few days until I got back. Restic now has it’s own volume to use and hasn’t filled it up.

After it happened the restic check the script runs was reporting “pack does not exist” but a rebuild-index fixed those problems.

Now I have an “id not found” error whenever I run restic check:

# restic check --cache-dir=/var/backups/restic/cache/
using temporary cache in /var/backups/restic/cache/restic-check-cache-643850715
repository 920b7fbf opened successfully, password is correct
created new cache in /var/backups/restic/cache/restic-check-cache-643850715
create exclusive lock for repository
load indexes
check all packs
check snapshots, trees and blobs
error for tree c700f213:snapshots
  id c700f2138c6800e828c52ea351933372d9a19756a58cdca40770c23139167df4 not found in repository
[37:54] 100.00%  205 / 205 snapshots
Fatal: repository contains errors

rebuild-index isn’t helping, and restic prune fails with the following:

# restic prune
repository 920b7fbf opened successfully, password is correct
loading indexes...
loading all snapshots...
finding data that is still in use for 205 snapshots
[1:34] 12.68%  26 / 205 snapshots
id c700f2138c6800e828c52ea351933372d9a19756a58cdca40770c23139167df4 not found in repository
github.com/restic/restic/internal/repository.(*Repository).LoadBlob
	/restic/internal/repository/repository.go:162
github.com/restic/restic/internal/repository.(*Repository).LoadTree
	/restic/internal/repository/repository.go:728
github.com/restic/restic/internal/restic.loadTreeWorker
	/restic/internal/restic/tree_stream.go:37
github.com/restic/restic/internal/restic.StreamTrees.func1
	/restic/internal/restic/tree_stream.go:164
golang.org/x/sync/errgroup.(*Group).Go.func1
	/home/build/go/pkg/mod/golang.org/x/sync@v0.0.0-20200625203802-6e8e738ad208/errgroup/errgroup.go:57
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1374

I don’t know how to find tree c700f213 it talks about, but I can’t find the id fingerprint in the output of any 'restic list ’ command.

I don’t want to go writing more data to the repository until the error is fixed, but I’m out of ideas about how to fix it so any help would be greatly appreciated.

Thanks.

One fast way would be checking which snapshots that object touches:

restic find --pack c700f2138c6800e828c52ea351933372d9a19756a58cdca40770c23139167df4

And forgetting the snapshots which depends on it, if it’s acceptable.

@amuckart First you should run a check to see if other pack files are also affected.

Doing more backups won’t make too much harm to the repo, but be prepared that the created backups could be also defect in some way.
With 0.12.0, in contrast, running a backup can even heal your repo from errors if you ran an rebuild-index before. This is IMO the first and preferred way to “heal” a broken repo.

As an alternative, you can use this experimental PR which can modify snapshots such that only missing data is removed from the snapshot(s). In your case whole trees seem to be missing, so repair might as well through away large parts of you snapshot(s).

Ah I forgot about that one. Now reading it thoroughly, I don’t think I’ll suggest forgetting the whole snapshot anymore. Great work :+1:

I tried that but check was failing on the one error and exiting. Is there something else I should have done to see if there were other errors?

Ahh, that is good to know, thank you! I will look at replacing the ‘restic check’ at the beginning of my script with a rebuild-index.

Thank you, that’s a useful looking feature. Unfortunately because this is a production system I can’t easily get away with pulling experimental PRs into it, but because what I’m backing up are dirvish backups, and the script I wrote will recreate any that it things are missing, I can safely just remove the whole snapshot as @gurkan suggested.

Thank you.

Thank you, that got me on the right path. Doing a ‘find --pack’ didn’t find anything, but find --tree did:

restic find --tree c700f2138c6800e828c52ea351933372d9a19756a58cdca40770c23139167df4
repository 920b7fbf opened successfully, password is correct
Unable to load tree c700f2138c6800e828c52ea351933372d9a19756a58cdca40770c23139167df4
 ... which belongs to snapshot dd0eefb6d31a4f08d30dd925b73279d1bffe5f4ba034f5b661b9684ec7d77471.

Because I’m backing up dirvish backups and the wrapper script I wrote will re-create any backups it thinks are missing from restic, I can safely delete the snapshot, and that’s fixed the issue.

restic check --cache-dir=/var/backups/restic/cache/
using temporary cache in /var/backups/restic/cache/restic-check-cache-427893504
repository 920b7fbf opened successfully, password is correct
created new cache in /var/backups/restic/cache/restic-check-cache-427893504
create exclusive lock for repository
load indexes
check all packs
check snapshots, trees and blobs
[40:29] 100.00%  204 / 204 snapshots
no errors were found

Thank you for getting me on the right path to the solution!

2 Likes

Ah, you are right. In your case a tree blob was missing. I think we could improve the error handling of check in this case…

Note that rebuild-index does not check the repo as check would. On the other hand there is not much harm in running rebuild-index frequently. It however will

  • lock the repo forbidding any other action (this may be a problem for repos that are used for multiple backups)
  • need to scan (i.e. get the list of) all pack files. (This can be quite time consuming depending on the backend. But for a sane repo and using 0.12.0, rebuild-index should be always faster than a check.)

I was thinking to add the check for packfile existence to the backup command, but discarded it because the scan can be much more time-consuming than the actual backup (and would therefore slow people down who want to do backups very frequently).