Check, Rebuild-Index, Prune

akrabu · June 18, 2019, 8:19pm

I’m apparently on a restic-roll today. #badum-tss

I did read this but just want to make sure my assumptions are correct…

check = a read-only verification of the repository
prune = rewrites the index, removing unreferenced data from the index
rebuild-index = rewrites the index, ADDING unreferenced data to the index (but NOT to a snapshot)

Long story short, I got the dreaded "panic: runtime error: invalid memory address or nil pointer dereference” message. Pretty sure I ran “restic unlock” when something else was still actually writing to the repo. I screwed it up. Whoops. While researching what to do about it, I had some questions about check, rebuild-index, and prune.

So, basically, with a damaged repo, if I run prune, it’ll toss out unreferenced data at the expense of future time and bandwidth, but (hopefully) still fix my repo.

If I run rebuild-index, it’ll include that unreferenced data in the index, and (hopefully) also fix my repo. The advantage with this method being that when I run a backup with my newly fixed repo, it will reuse blocks that are still in the repo, thus saving time and bandwidth. After running said backup successfully, I could then run a prune, and remove the last remaining bit (if any) of unreferenced data, and live happily ever after… riiiiight?

TL;DR: Prune and rebuild-index both rewrite the index exactly the same way, but one removes unreferenced data, and the other adds it to the newly rewritten index. Check only performs a read-only check and doesn’t fix anything. Correct?

fd0 · June 19, 2019, 6:19am

Oh, that’s a bug for sure. Do you maybe still have the backtrace that was printed? restic should not just throw a backtrace and panic in the user’s face (or terminal).

Almost:

prune: Rewrites data to remove unneeded data, then writes a new index containing only referenced data
rebuild-index: Writes a new index from scratch containing all data in the repo (including unreferenced data)

Yep.

Correct.

Hopefully, yes

ProactiveServices · June 19, 2019, 10:16am

I had one of these recently and it turned out to be a bad local cache. After wiping the cache the repo checked 100% OK.

fd0 · June 19, 2019, 11:55am

Ah, interesting! I’m glad you figured out a workaround! If it happens again, please save the backtrace and open an issue on GitHub, so we can track down the issue and make restic behave in that case

akrabu · June 19, 2019, 1:14pm

Unfortunately I do not. I assumed it was my own blunder and didn’t think it was a bug (I immediately saw a backup client writing to the repo just after I forcibly unlocked and then purposefully “forgot” two snapshots).

Ahh, thanks for the clarification!

It was ONE specific client throwing this error, so if it turns out to be a bad cache, I’ll get you the backtrace and work with you to squish this particular bug.

I’m basically using paexec to have about 50 clients simultaneously backup to 6TB drive. It’s a stress test not only of my network and poor 6TB drive, but also of restic - so chances are I’ll be able to make it happen again hahaha. (We have another backup running, and this is more of an experiment on my behalf as an extra precaution, which is why I’m not that concerned lol.)

Ps. While we wait (I’m at 75% of “building a new index for repo” after 15 hours on a 3.5TB repo), how DOES one acquire and save the backtrace? Is that just what restic prints out on the screen when it errors out, or…?

cdhowie · June 19, 2019, 2:44pm

rebuild-index is also capable of healing a specific type of damage.

If the repository is missing some objects (blobs or trees) that are referenced by a snapshot, but the index thinks that they exist in the repository, then backup clients will deduplicate against the missing objects, which means newly-generated snapshots will also be damaged if they contain any of the missing data.

However, if you encounter this situation and run rebuild-index then the missing objects will be removed from the index. In that case, future backups could re-add those objects and effectively fix snapshots that are in a broken state.

akrabu · June 19, 2019, 2:46pm

@cdhowie Ahhh good to know. That’s exactly the type of thing I was wondering about. Thanks!

ProactiveServices · June 19, 2019, 7:16pm

Yup will do. I managed to delete the backtrace by mistake on this occasion!

akrabu · June 20, 2019, 5:32pm

Would restic check be capable of detecting this kind of error? Or do I just need to periodically run rebuild-index? Or just when I notice a problem?

cdhowie · June 21, 2019, 4:42pm

restic check will detect missing objects.

akrabu · June 21, 2019, 5:13pm

@cdhowie So that means if it passes restic check there should be no need for rebuild-index, correct??

cdhowie · June 21, 2019, 5:15pm

I believe that’s the case, yes.

akrabu · June 30, 2019, 4:20am

@fd0 Here you go, happened again. It was working fine but I had to CTRL+C a client backup to make some changes. When I tried again, I got this:

I deleted the caches, it didn’t help. Having to run rebuild-index again.

Funny thing is I have an exact clone of the repo. I did a sync from the working one to the new one that just broke. Nothing changed, and it still won’t back up. Just throws that error. Doubt rebuild-index is going to help but I’m giving it a shot overnight.

I should mention it’s the same Server 2008 R2 client as before, too.

akrabu · June 30, 2019, 2:57pm

The rebuild-index fixed the issue. The thing is, the old repo was working. The new repo was a byte-for-byte duplication.

I’m thinking maybe I missed some cache files and the new indexes caused the client to rebuild the cache I missed??

fd0 · June 30, 2019, 5:53pm

Uh, congratulations, you found a bug!

Did restic print anything before the panic? Like, was there an error loading an index file?

I assume this is restic 0.9.5, is that right?

What happens here is that restic tries to load the index files and somehow fails, and then it does not correctly check for errors. It all starts in LoadIndex() here:

idx, buf, err = LoadIndexWithDecoder(ctx, r, buf[:0], fi.ID, DecodeIndex)
if err != nil && errors.Cause(err) == ErrOldIndexFormat {
	idx, buf, err = LoadIndexWithDecoder(ctx, r, buf[:0], fi.ID, DecodeOldIndex)
}

select {
case indexCh <- idx:
case <-ctx.Done():
}

Some worker function loads an index, checks for a specific error (ErrOldIndexFormat) and handles that, but does not check for other errors! So it may happen that err is set, and idx is nil. This nil index is then sent to indexCh.

In another Goroutine, a different function collects the loaded indexes here:

for idx := range indexCh {
	id, err := idx.ID()
	if err == nil {
		validIndex.Insert(id)
	}
	r.idx.Insert(idx)
}

It receives an index in the variable idx and calls the ID() function on it. If the received idx is nil, a panic happens in that function here:

func (idx *Index) ID() (restic.ID, error) {
	idx.m.Lock()

When idx is nil, then there’s no idx.m, so that’s when it panics.

Your run of restic rebuild-index rewrote the index files and removed the old ones, and somehow the error went away. It could either be an error accessing the index file at the repository or the local cache, unfortunately I cannot tell that.

I’ve managed to reproduce the issue locally:

Create a new repo in /tmp/foo, remember repo ID
Make a backup
Run chmod 000 /tmp/repo/index/*
Remove the cache via rm -rf ~/.cache/restic/ID
Run restic backup again:

repository 692de0f7 opened successfully, password is correct
created new cache in /home/fd0/.cache/restic
Load(<index/6b8f24ac36>, 0, 0) returned error, retrying after 552.330144ms: open /tmp/foo/index/6b8f24ac368d06aec6bddec3e0f915adc876267caf2fd52596f8af5f7b934225: permission denied
[...]
Load(<index/6b8f24ac36>, 0, 0) returned error, retrying after 13.811796615s: open /tmp/foo/index/6b8f24ac368d06aec6bddec3e0f915adc876267caf2fd52596f8af5f7b934225: permission denied

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x7bef48]

goroutine 54 [running]:
github.com/restic/restic/internal/repository.(*Index).ID(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/home/fd0/shared/work/go/src/github.com/restic/restic/internal/repository/index.go:377 +0x38
github.com/restic/restic/internal/repository.(*Repository).LoadIndex.func5(0x8, 0xe75950)
	/home/fd0/shared/work/go/src/github.com/restic/restic/internal/repository/repository.go:467 +0xce
golang.org/x/sync/errgroup.(*Group).Go.func1(0xc000372750, 0xc000392240)
	/home/fd0/shared/work/go/pkg/mod/golang.org/x/sync@v0.0.0-20181221193216-37e7f081c4d4/errgroup/errgroup.go:57 +0x57
created by golang.org/x/sync/errgroup.(*Group).Go
	/home/fd0/shared/work/go/pkg/mod/golang.org/x/sync@v0.0.0-20181221193216-37e7f081c4d4/errgroup/errgroup.go:54 +0x66

I’ll create a PR which fixes this shortly.

fd0 · June 30, 2019, 7:34pm

Will be fixed in #2321, that’s much better:

repository 4aff599a opened successfully, password is correct
Load(<index/733a2fba14>, 0, 0) returned error, retrying after 552.330144ms: open /tmp/foo/index/733a2fba1427b05b622b0243cda4b574b100638465f6ee3053114991fead392f: permission denied
[...]
Fatal: unable to load index 733a2fba: open /tmp/foo/index/733a2fba1427b05b622b0243cda4b574b100638465f6ee3053114991fead392f: permission denied

akrabu · June 30, 2019, 11:26pm

Yay! Glad I could help!

That said, do you think it could cause a Windows “bugcheck”? That same box seems to be rebooting whenever I run Restic on it, after the index rebuild. And I suspect it’s what has caused a few others in the past.

The computer has rebooted from a bugcheck. The bugcheck was: 0x0000003b (0x00000000c0000005, 0xfffff960000d592f, 0xffffd00027ed9150, 0x0000000000000000). A dump was saved in: C:\Windows\MEMORY.DMP. Report Id: 063019-8031-01.

fd0 · July 1, 2019, 6:54am

I’m not sure, but it’s possible. If you like, you can try that with a very simple Go program:

package main

import "fmt"

func main() {
	var foo *string
	fmt.Println(*foo)
}

Install Go, save the file as main.go somewhere and do go run main.go. On Linux, this causes a segmentation fault signal, I’m not sure what it does on Windows.

akrabu · July 1, 2019, 4:09pm

Didn’t crash the machine. Just displayed this:

C:\Users\Akrabu\Desktop>go run main.go
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x8 pc=0x49113a]

goroutine 1 [running]:
main.main()
C:/Users/Akrabu/Desktop/main.go:7 +0x2a
exit status 2

I’m basically getting this every time I try to run restic on this server:

On Mon 7/1/2019 12:18:11 AM GMT your computer crashed
crash dump file: C:\Windows\Minidump\063019-8031-01.dmp
uptime: 01:11:23
This was probably caused by the following module: win32k.sys (win32k!GetDCEx+0x4F)
Bugcheck code: 0x3B (0xC0000005, 0xFFFFF960000D592F, 0xFFFFD00027ED9150, 0x0)
Error: SYSTEM_SERVICE_EXCEPTION
file path: C:\Windows\system32\win32k.sys
product: [Microsoft® Windows® Operating System](Microsoft® - Google Suche Windows® Operating System)
company: [Microsoft Corporation](Microsoft - Google Suche Corporation)
description: Multi-User Win32 Driver
Bug check description: This indicates that an exception happened while executing a routine that transitions from non-privileged code to privileged code.
This appears to be a typical software driver bug and is not likely to be caused by a hardware problem.
The crash took place in a standard Microsoft module. Your system configuration may be incorrect. Possibly this problem is caused by another driver on your system that cannot be identified at this time.

Restic’s memory usage hovers around 2-3GB. A total of 5.1GB of 8GB used right before it crashed, if that’s helpful.

EDIT: Scratch that. It just did it without running restic. Not your problem. Restic probably just stresses whatever the underlying problem is. Disregard!

akrabu · July 16, 2019, 4:12pm

@fd0 Is this fix available in a beta yet? I have restic backing up ~80 users every time their screen locks, and I’ve started to get a lot of failed backups due to this bug. I have to manually go into each client that’s failing and wipe out their cache. Thanks!