Check, Rebuild-Index, Prune

Ah, interesting! I’m glad you figured out a workaround! If it happens again, please save the backtrace and open an issue on GitHub, so we can track down the issue and make restic behave in that case :wink:

Unfortunately I do not. I assumed it was my own blunder and didn’t think it was a bug (I immediately saw a backup client writing to the repo just after I forcibly unlocked and then purposefully “forgot” two snapshots).

Ahh, thanks for the clarification!

It was ONE specific client throwing this error, so if it turns out to be a bad cache, I’ll get you the backtrace and work with you to squish this particular bug. :+1:

I’m basically using paexec to have about 50 clients simultaneously backup to 6TB drive. It’s a stress test not only of my network and poor 6TB drive, but also of restic - so chances are I’ll be able to make it happen again hahaha. (We have another backup running, and this is more of an experiment on my behalf as an extra precaution, which is why I’m not that concerned lol.)


Ps. While we wait (I’m at 75% of “building a new index for repo” after 15 hours on a 3.5TB repo), how DOES one acquire and save the backtrace? Is that just what restic prints out on the screen when it errors out, or…?

1 Like

rebuild-index is also capable of healing a specific type of damage.

If the repository is missing some objects (blobs or trees) that are referenced by a snapshot, but the index thinks that they exist in the repository, then backup clients will deduplicate against the missing objects, which means newly-generated snapshots will also be damaged if they contain any of the missing data.

However, if you encounter this situation and run rebuild-index then the missing objects will be removed from the index. In that case, future backups could re-add those objects and effectively fix snapshots that are in a broken state.

3 Likes

@cdhowie Ahhh good to know. That’s exactly the type of thing I was wondering about. Thanks!

1 Like

Yup will do. I managed to delete the backtrace by mistake on this occasion!

1 Like

Would restic check be capable of detecting this kind of error? Or do I just need to periodically run rebuild-index? Or just when I notice a problem?

restic check will detect missing objects.

1 Like

@cdhowie So that means if it passes restic check there should be no need for rebuild-index, correct??

I believe that’s the case, yes.

1 Like

@fd0 Here you go, happened again. It was working fine but I had to CTRL+C a client backup to make some changes. When I tried again, I got this:

I deleted the caches, it didn’t help. Having to run rebuild-index again.

Funny thing is I have an exact clone of the repo. I did a sync from the working one to the new one that just broke. Nothing changed, and it still won’t back up. Just throws that error. Doubt rebuild-index is going to help but I’m giving it a shot overnight.

I should mention it’s the same Server 2008 R2 client as before, too.

The rebuild-index fixed the issue. The thing is, the old repo was working. The new repo was a byte-for-byte duplication.

I’m thinking maybe I missed some cache files and the new indexes caused the client to rebuild the cache I missed??

Uh, congratulations, you found a bug! :slight_smile:

Did restic print anything before the panic? Like, was there an error loading an index file?

I assume this is restic 0.9.5, is that right?

What happens here is that restic tries to load the index files and somehow fails, and then it does not correctly check for errors. It all starts in LoadIndex() here:

idx, buf, err = LoadIndexWithDecoder(ctx, r, buf[:0], fi.ID, DecodeIndex)
if err != nil && errors.Cause(err) == ErrOldIndexFormat {
	idx, buf, err = LoadIndexWithDecoder(ctx, r, buf[:0], fi.ID, DecodeOldIndex)
}

select {
case indexCh <- idx:
case <-ctx.Done():
}

Some worker function loads an index, checks for a specific error (ErrOldIndexFormat) and handles that, but does not check for other errors! So it may happen that err is set, and idx is nil. This nil index is then sent to indexCh.

In another Goroutine, a different function collects the loaded indexes here:

for idx := range indexCh {
	id, err := idx.ID()
	if err == nil {
		validIndex.Insert(id)
	}
	r.idx.Insert(idx)
}

It receives an index in the variable idx and calls the ID() function on it. If the received idx is nil, a panic happens in that function here:

func (idx *Index) ID() (restic.ID, error) {
	idx.m.Lock()

When idx is nil, then there’s no idx.m, so that’s when it panics.

Your run of restic rebuild-index rewrote the index files and removed the old ones, and somehow the error went away. It could either be an error accessing the index file at the repository or the local cache, unfortunately I cannot tell that.

I’ve managed to reproduce the issue locally:

  1. Create a new repo in /tmp/foo, remember repo ID
  2. Make a backup
  3. Run chmod 000 /tmp/repo/index/*
  4. Remove the cache via rm -rf ~/.cache/restic/ID
  5. Run restic backup again:
repository 692de0f7 opened successfully, password is correct
created new cache in /home/fd0/.cache/restic
Load(<index/6b8f24ac36>, 0, 0) returned error, retrying after 552.330144ms: open /tmp/foo/index/6b8f24ac368d06aec6bddec3e0f915adc876267caf2fd52596f8af5f7b934225: permission denied
[...]
Load(<index/6b8f24ac36>, 0, 0) returned error, retrying after 13.811796615s: open /tmp/foo/index/6b8f24ac368d06aec6bddec3e0f915adc876267caf2fd52596f8af5f7b934225: permission denied

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x7bef48]

goroutine 54 [running]:
github.com/restic/restic/internal/repository.(*Index).ID(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/home/fd0/shared/work/go/src/github.com/restic/restic/internal/repository/index.go:377 +0x38
github.com/restic/restic/internal/repository.(*Repository).LoadIndex.func5(0x8, 0xe75950)
	/home/fd0/shared/work/go/src/github.com/restic/restic/internal/repository/repository.go:467 +0xce
golang.org/x/sync/errgroup.(*Group).Go.func1(0xc000372750, 0xc000392240)
	/home/fd0/shared/work/go/pkg/mod/golang.org/x/sync@v0.0.0-20181221193216-37e7f081c4d4/errgroup/errgroup.go:57 +0x57
created by golang.org/x/sync/errgroup.(*Group).Go
	/home/fd0/shared/work/go/pkg/mod/golang.org/x/sync@v0.0.0-20181221193216-37e7f081c4d4/errgroup/errgroup.go:54 +0x66

I’ll create a PR which fixes this shortly.

1 Like

Will be fixed in #2321, that’s much better:

repository 4aff599a opened successfully, password is correct
Load(<index/733a2fba14>, 0, 0) returned error, retrying after 552.330144ms: open /tmp/foo/index/733a2fba1427b05b622b0243cda4b574b100638465f6ee3053114991fead392f: permission denied
[...]
Fatal: unable to load index 733a2fba: open /tmp/foo/index/733a2fba1427b05b622b0243cda4b574b100638465f6ee3053114991fead392f: permission denied
2 Likes

Yay! Glad I could help!

That said, do you think it could cause a Windows “bugcheck”? That same box seems to be rebooting whenever I run Restic on it, after the index rebuild. And I suspect it’s what has caused a few others in the past.

The computer has rebooted from a bugcheck. The bugcheck was: 0x0000003b (0x00000000c0000005, 0xfffff960000d592f, 0xffffd00027ed9150, 0x0000000000000000). A dump was saved in: C:\Windows\MEMORY.DMP. Report Id: 063019-8031-01.

I’m not sure, but it’s possible. If you like, you can try that with a very simple Go program:

package main

import "fmt"

func main() {
	var foo *string
	fmt.Println(*foo)
}

Install Go, save the file as main.go somewhere and do go run main.go. On Linux, this causes a segmentation fault signal, I’m not sure what it does on Windows.

Didn’t crash the machine. Just displayed this:

C:\Users\Akrabu\Desktop>go run main.go
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x8 pc=0x49113a]

goroutine 1 [running]:
main.main()
C:/Users/Akrabu/Desktop/main.go:7 +0x2a
exit status 2

I’m basically getting this every time I try to run restic on this server:

On Mon 7/1/2019 12:18:11 AM GMT your computer crashed
crash dump file: C:\Windows\Minidump\063019-8031-01.dmp
uptime: 01:11:23
This was probably caused by the following module: win32k.sys (win32k!GetDCEx+0x4F)
Bugcheck code: 0x3B (0xC0000005, 0xFFFFF960000D592F, 0xFFFFD00027ED9150, 0x0)
Error: SYSTEM_SERVICE_EXCEPTION
file path: C:\Windows\system32\win32k.sys
product: [Microsoft® Windows® Operating System](Microsoft® - Google Suche Windows® Operating System)
company: [Microsoft Corporation](Microsoft - Google Suche Corporation)
description: Multi-User Win32 Driver
Bug check description: This indicates that an exception happened while executing a routine that transitions from non-privileged code to privileged code.
This appears to be a typical software driver bug and is not likely to be caused by a hardware problem.
The crash took place in a standard Microsoft module. Your system configuration may be incorrect. Possibly this problem is caused by another driver on your system that cannot be identified at this time.

Restic’s memory usage hovers around 2-3GB. A total of 5.1GB of 8GB used right before it crashed, if that’s helpful.

EDIT: Scratch that. It just did it without running restic. Not your problem. Restic probably just stresses whatever the underlying problem is. Disregard!

3 Likes

@fd0 Is this fix available in a beta yet? I have restic backing up ~80 users every time their screen locks, and I’ve started to get a lot of failed backups due to this bug. I have to manually go into each client that’s failing and wipe out their cache. Thanks!

You could use the --no-cache flag in the mean time (or autowipe the cache if the call to restic fails)

I considered not using a cache, but want to keep the load on client PCs as light as possible.

For now I’m just periodically checking the logs, and if I see one is stuck, I wipe it manually. It hasn’t happened THAT much, thankfully. :slight_smile:

FYI, the fix is included in restic 0.9.6

1 Like