Unable to complete Check - Fatal: repository contains errors

jonreeves · December 20, 2018, 1:31pm

Hi there,
I have a repository (stored in Wasabi) with only two snapshots. I decided to run check on it yesterday to see if there were any unused files that could be purged (from interrupted uploads).

I eventually get this result:

...
unused blob 423d14cd
Fatal: repository contains errors

I’ve not gone on to try prune or rebuild-index in fear of damaging further. Any advice?

The repo is ~10TB. I have two others about the same size each with more snapshots and they both were able to check fine. (Although one needed a prune and the other a rebuild-index).

Cheers,
Jon

cdhowie · December 20, 2018, 7:10pm

An unused blob by itself is not an error. That could be the result of an incomplete backup or a deleted snapshot.

The error was likely earlier in the output. What is the full output of restic check?

jonreeves · December 20, 2018, 7:29pm

The output is relatively large, so have truncated the repetative stuff…
restic --cache-dir /tmp/restic/cache check --check-unused --no-lock

using temporary cache in /tmp/restic/cache/restic-check-cache-239410159
created new cache in /tmp/restic/cache/restic-check-cache-239410159
load indexes
check all packs
pack 14cf7b9b: not referenced in any index
...
803 additional files were found in the repo, which likely contain duplicate data.
You can run `restic prune` to correct this.
check snapshots, trees and blobs
unused blob 12d44fb8
... (30,514 entries)
unused blob 423d14cd
Fatal: repository contains errors

I’ve not removed any snapshots or performed any action other than backup and check on this repository.

I just tested to see if mount works and it did, and I was able to copy a file at random from within the repository.

cdhowie · December 20, 2018, 7:34pm

I just did some tests, and “unused blob” is in fact considered an error, so I was wrong about that. I’m not sure exactly why this is; I would consider it more on the level of “notice” or “warning.” It’s normal to see unused blobs, especially if a backup was ever interrupted.

A pack not referenced in an index is not considered an error.

You should be able to safely restic prune to discard the unused blob(s). This will also rebuild the indexes, which will fix the “pack not referenced” issue.

Side note: I would bet money that the packs not included in an index also contain the unused blobs. This is very strong evidence that the packs+blobs were uploaded as part of a backup operation that got interrupted.

jonreeves · December 20, 2018, 8:08pm

Thanks cdhowie, I’ll move forward with a prune and see how it goes.

jonreeves · December 21, 2018, 3:45pm

Took nearly 21hours but prune finished without error. I’m gonna run a check to see if it completes this time.

Any ideas why things like check, prune and rebuild-index take so long? My CPU and Memory were running between 1-5% the whole time, and my bandwidth is 1Gbps but only seemed to use 3Mbit every 10 seconds?

jonreeves · December 21, 2018, 4:20pm

I take that back, the check just performed seemed to only take 30mins. I guess when there are problems found things are slower… or maybe it was quicker this time because of the local cache?

FYI… the check was successfull. Yay, thanks for your help @cdhowie

cdhowie · December 21, 2018, 6:05pm

With a 10TB repo I would expect prune and rebuild-index to take a significant amount of time, especially on remote object storage. This is because both commands disregard existing indexes and have to scan every pack’s header. A large chunk of this time is simply the round-trip to obtain the list of packs; since most object storage systems impose some limit on the number of objects that a single “list” operation will return, this operation must be invoked many times.

Based on your repository size of 10TB and an average pack size of 8MB, you have about 1,310,720 packs. With a 1000-object limit per ListObjects call, it will take 1,311 round-trips to the Wasabi S3 endpoint to retrieve a list of all of the packs.

Then the header from each pack must be fetched. This adds a ton of round-trip latency, but the amount of data returned each call is low and the amount of CPU required to process the results is also low.

Basically, restic is spending all of its time waiting on the network.

jonreeves · December 22, 2018, 12:49am

I see, I’m guessing these requests aren’t done in parallel too, which probably doesn’t help.

Thanks for the info. Good to know everything is functioning as intended.

cdhowie · December 22, 2018, 3:46am

There’d have to be intentional work done to make the “list object” request parallel, because it’s not as simple as “run a bunch of requests on different threads.”

The S3 ListObject call can only return 1,000 objects at a time. If there are more objects to list, it returns a marker value that you can pass to the next ListObject call to continue where the previous one left off. This means you don’t know how to list a page until you’ve listed the previous page, necessarily precluding any concurrency.

However, since restic stores packs under a name that’s the hex representation of the SHA256 sum of the contents, we can enumerate, say, a byte’s worth of prefixes concurrently (00 through ff). On small repositories, this can result in many more list operations being performed than is necessary, so there would need to be some kind of initial test to see if this is beneficial. Off the top of my head, a reasonable implementation could be:

List all objects under data/.
- If the result set is incomplete (requires more list API calls to fetch the full set) and the results contains any pack that doesn’t start with data/00/ (that is, there are fewer than 1,000 packs starting with 00) then proceed as usual, performing only one operation at a time. There will likely be fewer than 256 API calls necessary to finish listing the objects.
- Otherwise, there are likely 1,000 or more packs with each prefix.¹ In this case, start 256 parallel list operations – one for each prefix from data/00/ through data/ff/.

@fd0 Has any work been done in this area (making list object calls parallel)? If not, does this seem like a sane optimization?

¹ This takes advantage of the fact that, as the number of files being hashed increases, where the contents are largely unpredictable, the distribution of the hash function output over its domain will approach uniformity. Therefore, we can reasonably conclude that if there are N packs with a particular prefix, there are likely to be approximately N packs with every other (non-intersecting) prefix.

ifedorenko · December 22, 2018, 4:28am

FWIW, I have parallel list implementation in my custom OneDrive backend https://github.com/ifedorenko/restic/blob/onedrive/internal/backend/onedrive/onedrive.go#L630-L732. It runs configured number of goroutes and each gorouting lists one of the pack directories. According to commit comment, “time to list 92K of data files went from 192 to 34 seconds”. Should be fairly straightforward to make the logic available to all backends, assuming fd0 is interested and has time to review PR, of course.