Repository broken - id not found in repository

Hi there,

I know there’s other topics covering this (similar) case but I didn’t found the exact situation like mine so here I go with this thread.

restic prune is not working as it’s returning an id not found in the repository.

repository 2461be36 opened (version 1)
loading indexes...
loading all snapshots...
finding data that is still in use for 30 snapshots
[0:00] 0.00%  0 / 30 snapshots
id 01687380b40c74ec0dcebc50acaffec730ac0c4a335265272e9a1b6eb542baa1 not found in repository
github.com/restic/restic/internal/repository.(*Repository).LoadBlob
	/restic/internal/repository/repository.go:274
github.com/restic/restic/internal/restic.LoadTree
	/restic/internal/restic/tree.go:113
github.com/restic/restic/internal/restic.loadTreeWorker
	/restic/internal/restic/tree_stream.go:36
github.com/restic/restic/internal/restic.StreamTrees.func1
	/restic/internal/restic/tree_stream.go:176
golang.org/x/sync/errgroup.(*Group).Go.func1
	/home/build/go/pkg/mod/golang.org/x/sync@v0.1.0/errgroup/errgroup.go:75
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1598

We tried the following commands (sorted by time)

restic check returns lots of file or directory not found.
restic rebuild-index didn’t help
restic prune again to check, but did nothing
restic backup --force did nothing new
restic prune to see if backup force helped, but nope…
restic rebuid-index --read-all-packls didn’t help neither
restic prune again, nothing worked.

the server is a 24x7 online server in a CPD that mades daily backups, one of them copied with restic from a remote server (using restic-server for the repo access via -r)
We’ve checked server, dmesg, smartctl, and we see no errors at all. Doesn’t look like a problem in t he fs or hardware.

anyone has any clue what’s going on? how can I Get rid of those issues?

also, in case there’s nothing to do, how can I remove all the snapshots in one comand to start with the same repository from scratch?

Thanks in advance.

When reporting a problem, please include the relevant info such as the exact command you run including any env vars it uses, and in this case I think the output from the check command would be good to look at. Also include obvious things like what restic and rest-server versions you run :slight_smile:

Hi,

My apologies, replying inline:

Those commands are ran on the restic’s server so it’s basically

restic -r /path/to/repo --password-file /path/to/password/file prune
restic -r /path/to/repo --password-file /path/to/password/file rebuild-index
restic -r /path/to/repo --password-file /path/to/password/file rebuild-index read-all-packs
…etc, the same list of commands I’ve added before. I’m not adding anything else.

lots of:

   id xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx not found in repository

and lots of

error for tree xxxx
  tree  xxxxxxxxxxx file yyyy not found in in dex

and then

Fatal: repository contains errors
restic version
restic 0.15.2 compiled with go1.20.3 on linux/amd64

rest-server running on docker restic/rest-server:0.12.0

Did you notice any errors reported in prior restic runs (before the failure) or did something unexpected happen? Maybe an interrupted network connection?

Do you have logs of the backup and prune before the first failed prune? Did restic report any errors there?

How often do you run backup, check and/or prune?

Is the repository used by multiple hosts or just a single host?

Do you use the unlock command or is it called automatically from your backup scripts?

What did the first rebuild-index run print?

The problem is that some tree blobs are missing, so prune cannot determine what data is still needed and what not.

Please try to locate the affected snapshot(s) by running restic find --tree 01687380b40c74ec0dcebc50acaffec730ac0c4a335265272e9a1b6eb542baa1.

Judging from the amount of missing tree blobs, at least one file in the data folder of the repository was lost. It could affect either a single or multiple snapshots.

To repair the repository, just either have to delete the broken snapshots (which means finding out which ones are affected first, see Recover from broken pack file · Issue #828 · restic/restic · GitHub Route 2), or you can use a beta version of restic (restic beta releases (/)) which can automatically repair damaged snapshots by removing the broken parts (see Troubleshooting — restic 0.16.3 documentation)
Recover from broken pack file · Issue #828 · restic/restic · GitHub

Hi,

Sorry didn’t read you reply until today. I came back as the repository got broken again.

Replying inilne:

Did you notice any errors reported in prior restic runs (before the failure) or did something unexpected happen? Maybe an interrupted network connection?

Didn’t see any errors or got any alarm related to this servers, but, Network issues are very common on the provider hostinig those servers. (rest-server and client with restic)

Do you have logs of the backup and prune before the first failed prune? Did restic report any errors there?
We’re not monitoring restic output messages, maybe there was something warning on its output, but jobs ran by bacula that runs restiic script, are being done with no errors. If the bash script thatt runs restic got an yerror it should fail the job on bacula side, if I have all the wiring correctly, of course.

How often do you run backup, check and/or prune?
backup: daily
forget: daily (keep-last 20 snapshots)
prune: daily (daily cleanup)

check: when I see there’re errors.

Is the repository used by multiple hosts or just a single host?

Just a single host.

Do you use the unlock command or is it called automatically from your backup scripts?

That’s what I’m afraid can be the cause… We do unlock repostory everytime with force parameter, if not there’s no way to run prune.

Basically, bacula runs the bacula job and ther’es a post-action (that only happen with main job is finished sucessfully) that runs the forget keep last 20 and prune. In order to allow restic to do the forget and the prune, I have to add the unlock force, if not, there’s no way I can perform any other actiion on the repository. i’m afraid that could be the problem, something could be wrong on Restic / rest-server, or maybe I missunderstood how to use restic.

What did the first rebuild-index run print?

The repository is huge (around +35TB) so it take soooo long doiing checks and rebuild index. Actually this las time we got the repositiory broken, the check took 2 days to complete 18% of the check. Makes no sense, something’s really broken on the repository. So I deleted all the snapshots to start backing up again from scratch.

What I’m doing wrong? Are the unlocks doing a mess on the repository? How we should manage this locks problem when the main restic backup commnad finishes?

Thanks in advance.

Does “unlock force” mean that you run unlock (safe to use with restic >= 0.10.0) or unlock --remove-all ( NEVER EVER run the latter command if any other restic process is still active )? Running unlock without parameters is safe in general as it only removes old locks.

restic can run multiple concurrent backup operations on a repository. However, the forget and prune command require exclusive access, that is no other restic process can use the repository while these commands run. Each command also cleans up its own lock once it has finished successfully. If the network connection is interrupted, then a command cannot clean up its stale lock. That should be the only reason why stale locks are left behind. These can be cleaned up using unlock (without options!).

If the lock of a running restic process is removed (that’s what unlock --remove- all would do) this can allow backup and prune executions to overlap, which likely will result in damaging the repository. Although based on you description it sounds like the bacula job first runs a backup and afterwards runs the forget/prune steps, that would prevent the steps from overlapping. In this setup it should never be necessary to unlock the repository unless the network connection of the host failed right at the moment when these commands finished. That case would result in several warnings from restic related to removing a lock file. Most likely a network interruption would also prevent the actual command from working and therefore result in a non-zero exit code.

The main question is why the repository remains locked. If you remove the unlock call, what is the full error message printed by forget or prune (the relevant part is the information how old the conflicting lock file is)?

As the repository is rather large: could there be an overlap between the previous and the current bacula job?

Thanks for your detailed response. All you said makes totally sense on what we see, also considering the network issues.

We’ll remove the --force parameter on the unlock. I guess this will fix the repo corruption issues. However will still have the first problem: how do we manage those sticky locks caused by network issues?

btw: backups are not overlaping from one day to anohter, it tooks around 12h currently.

A plain call to unlock should be enough. Another somewhat ugly workaround would be to add a sleep for half an hour and call unlock afterwards. (Locks that have not been updated for half an hour are considered stale and will be removed by the unlock command).

If that doesn’t help, then we’ll have to dig deeper on where the problematic lock file is created. It sometimes help to list all lock files using restic list locks --no-lock and then run restic cat lock lock-file-hash with one of the hashes printed by the first command. That will print the information contained in the lock files.

After removing the remove all locks to the unlock command, and trggering the prune again, (days after the last backup, so no previous backup was running before), I found another lock.

Is there any relevant information on this lock information? This is the cat lock command:

repository 2461be36 opened (version 1)
{
  "time": "2023-08-02T15:21:13.611004683Z",
  "exclusive": false,
  "hostname": "ebf21d53445c",
  "username": "root",
  "pid": 2957
}

As the lock is non-exclusive ("exclusive": false,), it is created by the backup command or others that do not remove data from a repository.

When exactly (I need the exact time) did you retrieve that lock? What was the output from prune?

There are essentially two waits to end up with such a lock: either a restic process is still running, or restic was killed while running (restic ... | less or similar is a problem is the output is not read completely such that restic is killed by a SIGPIPE).

The only way forward here is to find out which restic commands are run against that repository, check whether any of them is still running or haven’t exited cleanly.

I’m afraid that’s the problem. Restiic is currently ran by a bacula job everyday. And just realized that some days there’s job tasks scheduled pending to previous backups to finish.

We removed the remove-all parameter, to fix one of the problem. But then we have the other problem: restic tooks so many timie to finish backups. As it’s a huge amount of data, we’ll be testing the --ignore-ctime paramter.

Thanks for your help Michael, it’s been so helpful.

Confirmed restiic tooks more than 24h so restic, so the unlock remove all was doing all the mess.

The --ignore-ctime paramter did not help to reduce the backup time.

We’ll look for another solution. We could consider the repository broken issue as solved.

Thank you so much for the help.

I’m curious: how large (files count & bytes) is the dataset you’re backing up?

If possible please consider splitting the backup into multiple repositories, ideally below 10TB. That will also make it much faster to check / prune an individual repository.

Stats of last snapshot:

        Total File Count:  268416
              Total Size:  33.992 TiB

We’ll evaluate the split, but I have doubts on it. By the way, just to let you know that the ignore-ctime didn’t helped. We’re now running restic every 2 days instead of everyday. Looking good so far for now.

Do you know if there’s any parameter for restic to avoid doing full checksum if filenames and filesizes are the same? As I understand restic backup command first reads all the files to backup and then compares to the repository on the remote rest-server, so I’m afraid answer is no. What about this?

For backups of a fixed set of path, restic already automatically detects the last snapshot for these paths and only backs up changed files. Look for using parent snapshot [...] right at the start of a backup run.

If the backup paths or the host name changes, then you have to manually specify the parent snapshot using --parent snapshotID. See also Backing up — restic 0.16.3 documentation for a more detailed explanation. Since restic 0.16.0 it is also possible to add a unique tag to each backup set and combine that with --group-by tags.

Hi Michael,

Thanks for your reply ,this is very helpful. And that makes lots of sense.

We include the date on the path, so everydays ,we backup different path names.
This is an example on the directoreis tree that we backup with restic:

Today’s 21th August. So yesterday’s backup paths were:

/path/to/dir1/20082023/*
/path/to/dir2/20082023/*

Today’s backup will be

/path/to/dir1/21082023/*
/path/to/dir2/21082023/*

This is actually up to 5 dirs per spanshot. So we run restic including those paths on the backup command.

restic backup -r rest:https://xx.xx.xx.xx@xxxxxxxxxx:8000/backup-Repo/ \
                 /path/to/dir1/${DATE}/ \
                 /path/to/dir2/${DATE}/ \
                 /path/to/dir3/${DATE}/ \
                 /path/to/dir4/${DATE}/ \
                 /path/to/dir5/${DATE}/ \
                 --cacert /opt/restic/public_key --ignore-ctime

and so we have an output like the following below on the restic snapshots command:

                                                         /path/to/dir1/19082023
                                                         /path/to/dir2/19082023
                                                         /path/to/dir3/19082023
                                                         /path/to/dir4/19082023
                                                         /path/to/dir5/19082023

888hhh333  2023-08-20 19:25:43  e08d160586a3
                                                         /path/to/dir1/20082023
                                                         /path/to/dir2/20082023
                                                         /path/to/dir3/20082023
                                                         /path/to/dir4/20082023
                                                         /path/to/dir5/20082023

An important detail here is that those paths are hard links,
Will this parent snapshot parameter help on this? Reading at the documentation that this will not work wiith simlinks, so I’m not sure we’re on the right path with the parent backup specified on the backup command.

In case that’s that we need, I have a question: is there any command taht will return just the last snapshot ID? the list snapshots shows a list of long IDs. The snapshots command show the short ID (I think this iis the ID we need). Should I extract the short ID from the snapshots command using cut/sed/awk or is there anything on restic that can return the last snapshot ID I need?

Thanks in advance for all the help.

If the path to the source data is different in each backup, then there is no way to avoid reading all files again. The --parent parameter cannot help in that case.

Since restic 0.16.0 the easiest way would be to use --group-by host,tag (see Backing up — restic 0.16.3 documentation) and distinguish the backup sets using a unique tag.

However, as this is only a way to influence how the parent snapshot is selected, it can’t help with avoiding a full scan of all files.