Corrupted repo after rebuild-index


#1

restic 0.9.1 compiled with go1.10.3 on darwin/amd64

Repo is on Wasabi (s3:https://us-east-1@s3.wasabisys.com)
Timeline :

  • restic check
    –>1 additional files were found in the repo, which likely contain duplicate data.
    You can run restic prune to correct this.
  • restic prune
  • restic rebuild-index
  • restic check

As I’m on a real slow connection on one of my machines, I must confess I did some Ctrl-C during backup or check (and perhaps prune).

The messages are “tree not found in Index”, do that mean some part of the repo are missing ?

If so, as an end user (even a bit techy), how do I know to which file or snapshot these errors belongs to ?
I’m in the process of trying a --read-data-subset 1/10, but that outputs the same errors, so it’s helpless.
Is the --read-data flag making some fixing ?
Shouldn’t restic tell me to forget the culprit snpashot ? Or perhaps flag the culprit file as bad if the snapshot is to keep ?

I think these kind of messages should be for a debug level, as a normal user I hope for a message like
“file /bla/bla in snapshot xxxx is corrupted, run --read-data to fix it” or “file /bla/bla in snapshot xxxx is corrupted, you should forget snapshot xxx”.
Is --read-data fixing things ? Or is it just a confirmation the real data is different in the repo vs on the source ?
Could it be fixed by re-fetching the corrupted file ? (as far as I understood restic, this would generate a new snapshot, no ?)
Do that mean the whole repo is corrupted ? Or only one or some snapshots ?

$ restic -r s3:https://us-east-1@s3.wasabisys.com/my-backup rebuild-index
repository 2c128c6e opened successfully, password is correct
counting files in repo
pack file cannot be listed 824981796608e9e4d16bd6534e461c304f7e4d876106fa4e56206381be698d0f: EOF
[35:54] 100.00%  132671 / 132671 packs
finding old index files
saved new indexes as [fa145cb3 0bc1df87 4e185daa 3fee5dd5 83d391d4 14614059 2558d313 ed0d127f de82203e f7111dc4 f15c6a1c 6750607b b2cae93a 8635cba1 b86af3cb cda9055e da279e96 6405aa99 08aca003 edce9582 9c9055e3 7447af9c c788189d 01edeb63 10a2f9ea 981cab99 492c24be 7a65bedc 9a395cab 97851de1 0757a523 b50fad91 a29bed50 7495e5d8 58f5af8b e7557c47 5d066599 ab9b2ebe a3dde579 bd455dc3 aa18e796 c4043df0 53c12dbc 0a6b256f a20178ad]
remove 1359 old index files

$ restic -r s3:https://us-east-1@s3.wasabisys.com/my-backup check
using temporary cache in /var/folders/r1/3sr1p9391ql86sbhkyzw8qk80000gn/T/restic-check-cache-920683911
repository 2c128c6e opened successfully, password is correct
create exclusive lock for repository
load indexes
check all packs
pack 82498179: not referenced in any index
1 additional files were found in the repo, which likely contain duplicate data.
You can run `restic prune` to correct this.
check snapshots, trees and blobs
error for tree 1b3c7305:
  tree 1b3c73055917aaae934bf5ec8ec2537c3bb7d60358f0661f8fc9830820180096 not found in repository
error for tree 0e02d4e1:
  tree 0e02d4e15d50157f88c1491d62c04cb8be5f75274daf15ce91f41f746e0f5fdd not found in repository
error for tree 3657ccc4:
  tree 3657ccc484e6b125a8a2ea3f1c03a7ddfb301aceeeae10316b683479706470ba not found in repository
error for tree e90739d7:
  tree e90739d7a4eeffd7e4968e0aa6d99802f6217867afe7083279f1acfd71281619 not found in repository
error for tree 183965da:
  tree 183965da7985686f59d30a60fd7b0e40ab0d8d0b2a1f93842e0090171d062988 not found in repository
error for tree 7bb16c02:
  tree 7bb16c022e926cade02ee53dee1f6c3a4da3eeb6bbc406ca3610adc99e273f46 not found in repository
error for tree 3440509d:
  tree 3440509dc76f90b922b6e0a42864a7b3a7d7ad00576d6c10fa62195c083e75f4 not found in repository
error for tree 03a48092:
  tree 03a48092c6e9d1225b2a3c19ead844abcb0738e8a893e8ae0f504e353e839b97 not found in repository
error for tree 8753b186:
  tree 8753b1863cdcfb4a5d91e2679c4010039c58d024a13da38871558c1da0bcf015 not found in repository
error for tree 1c8fdbb2:
  tree 1c8fdbb247b22a4c205c6863b2aeb59a046faa5e13b75bb908c71e46ca2fa63e not found in repository
error for tree 8cb3472f:
  tree 8cb3472fd490820980a5f6813bbcee91e078875a9fd316a02a3b29bdd8be3ab6 not found in repository
error for tree ca7447c6:
  tree ca7447c62e848a9de6c53737d24145d5598015ae43c9e9d8db97aa71eb4a1469 not found in repository
error for tree ec3a2b55:
  tree ec3a2b55fc2fcb291869ce17b35067566a11184af3f82d33432f42850f40667a not found in repository
error for tree ef7e8e0d:
  tree ef7e8e0ddb37be06e94330985018089d98f77b2682eeb07270f4457ff7e85174 not found in repository
error for tree a452188b:
  tree a452188b3f74111ec793e7812454c7ba6ffe35e9e84cf9ba572b5a2774f07645 not found in repository
error for tree 5e118e8e:
  tree 5e118e8e0d3cff06ce185fcde7af8e27c22488edb55f12d76a8464c22093d4b6 not found in repository
Fatal: repository contains errors

$ restic -r s3:https://us-east-1@s3.wasabisys.com/my-backup rebuild-index
repository 2c128c6e opened successfully, password is correct
counting files in repo
pack file cannot be listed 824981796608e9e4d16bd6534e461c304f7e4d876106fa4e56206381be698d0f: EOF
[29:38] 100.00%  132671 / 132671 packs
finding old index files
saved new indexes as [470f61f3 e9d66ca9 c73b8917 cfcf80ac 2acf1e1d ea50b481 b2ffb59b d4508542 242c83a8 9376b69a 945d6b70 d5603c37 6de3add4 18cc8ac4 2683aa09 fa3b9413 042d9f9c 0b52a567 b77d9800 c525e722 378a3fe9 86ec589b 8f6df71f 218385e0 747f7cbc c62d6fb6 32d7312e c862a140 e8b03ba6 c816f9c6 e524c1a7 e295e321 34a68b8e a528288f 6d4bcbec 28cdc531 42dc22ab a608494d ab154703 641b2832 77397e72 f174e0c3 bcd23b39 0a351469 82f1bf0e]
remove 45 old index files

I’m now in the process of trying a --read-data-subset 1/10, and it gives me the exact same list of errors.

$ restic  -r s3:https://us-east-1@s3.wasabisys.com/my-backup check --read-data-subset 1/10
using temporary cache in /var/folders/8z/rnjlwcm15pv1rbhq2q34mmr00000gn/T/restic-check-cache-058842757
repository 2c128c6e opened successfully, password is correct
create exclusive lock for repository
load indexes
check all packs
pack 82498179: not referenced in any index
1 additional files were found in the repo, which likely contain duplicate data.
You can run `restic prune` to correct this.
check snapshots, trees and blobs
error for tree 1b3c7305:
  tree 1b3c73055917aaae934bf5ec8ec2537c3bb7d60358f0661f8fc9830820180096 not found in repository
error for tree 0e02d4e1:
  tree 0e02d4e15d50157f88c1491d62c04cb8be5f75274daf15ce91f41f746e0f5fdd not found in repository
error for tree e90739d7:
  tree e90739d7a4eeffd7e4968e0aa6d99802f6217867afe7083279f1acfd71281619 not found in repository
error for tree 3657ccc4:
  tree 3657ccc484e6b125a8a2ea3f1c03a7ddfb301aceeeae10316b683479706470ba not found in repository
error for tree 183965da:
  tree 183965da7985686f59d30a60fd7b0e40ab0d8d0b2a1f93842e0090171d062988 not found in repository
error for tree 03a48092:
  tree 03a48092c6e9d1225b2a3c19ead844abcb0738e8a893e8ae0f504e353e839b97 not found in repository
error for tree 8753b186:
  tree 8753b1863cdcfb4a5d91e2679c4010039c58d024a13da38871558c1da0bcf015 not found in repository
error for tree 7bb16c02:
  tree 7bb16c022e926cade02ee53dee1f6c3a4da3eeb6bbc406ca3610adc99e273f46 not found in repository
error for tree 3440509d:
  tree 3440509dc76f90b922b6e0a42864a7b3a7d7ad00576d6c10fa62195c083e75f4 not found in repository
error for tree ca7447c6:
  tree ca7447c62e848a9de6c53737d24145d5598015ae43c9e9d8db97aa71eb4a1469 not found in repository
error for tree ef7e8e0d:
  tree ef7e8e0ddb37be06e94330985018089d98f77b2682eeb07270f4457ff7e85174 not found in repository
error for tree 8cb3472f:
  tree 8cb3472fd490820980a5f6813bbcee91e078875a9fd316a02a3b29bdd8be3ab6 not found in repository
error for tree ec3a2b55:
  tree ec3a2b55fc2fcb291869ce17b35067566a11184af3f82d33432f42850f40667a not found in repository
error for tree 1c8fdbb2:
  tree 1c8fdbb247b22a4c205c6863b2aeb59a046faa5e13b75bb908c71e46ca2fa63e not found in repository
error for tree 5e118e8e:
  tree 5e118e8e0d3cff06ce185fcde7af8e27c22488edb55f12d76a8464c22093d4b6 not found in repository
error for tree a452188b:
  tree a452188b3f74111ec793e7812454c7ba6ffe35e9e84cf9ba572b5a2774f07645 not found in repository
read group #1 of 13437 data packs (out of total 132670 packs in 10 groups)
[3:27:25] 23.73%  3188 / 13437 items

As you can see, 24 % of the job took 3,5 hours.
As I already have the list of errors can I safely stop this process ?


#2

Unfortunately, yes, this means that restic tried to access data and was unable to find it. First, restic looks in the index files, if it can’t find the data there it aborts. A new index can be built with restic rebuild-index, this instructs restic to inspect all files in the repo and build a new index of what data is available.

If you have successfully run restic rebuild-index and still restic check complains about missing data, then the data is not there. I don’t have any idea how that happened. Aborting a backup (or even restic prune) should not lead to lost data, as restic’s code is written in a very conservative way. For example, restic prune only really removes data at the very very end, when all rewritten data was uploaded successfully. So when the process is aborted, worst case you’ll end up with duplicate data in the repo, which I think is way better than lost data.

There’s a Pull Request (that I still need to find time to review) which adds this capability here:

That’d be a good course of action, but it’s not implemented yet. Sorry about that.

And I agree with you: The user interface for restic check (and restic prune) is not great, we’ll improve that over time, in small steps.

Next steps

You can try re-running restic rebuild-index, it complained that at least one file could not be listed, which may contain the needed data:

pack file cannot be listed 824981796608e9e4d16bd6534e461c304f7e4d876106fa4e56206381be698d0f: EOF

This error may be related to a bug reported yesterday:

  • Can you check if the file 824981796608e9e4d16bd6534e461c304f7e4d876106fa4e56206381be698d0f is in the cache (somewhere below ~/.cache/restic) and compare its size to the size in the s3 backend?
  • Can you try removing restic’s cache (just delete ~/.cache/restic) and retry with restic rebuild-index? The cache will just be rebuilt if it isn’t there.

I’m positive that maybe you ran into the same bug as Matt, and removing the cache will make it work again. I hope to have some time over the weekend to work on the bug itself, it’s very annoying.


#3

In fact on Mac OS X, the cache is not in ~/.cache (there’s one on this Mac but it has nothing to do with restic), the cache is in ~/Library/Caches/restic.

Note also that I’m backing up three machines on the same repo, to take advantage of dedup (which works as I only have 600 Gb where on Crashplan I had 2Tb).

For your question, no, there’s no 824981796608e9e4d16bd6534e461c304f7e4d876106fa4e56206381be698d0f file in that cache.

I moved the restic cache directory (to keep it for ref if needed) and I’m in the process of doing a rebuild-index. I will then make a check and let you know the result.

About cache, I also noted that for a --read-data, the cache folder is completly elsewhere as restic tells it’s using /var/files/…

I was wondering if a
$ restic cat SanshotID | grep TreeNumber
on each snapshot would give me the answer to where is the culprit tree ?


#4

Ah, you’re right, the cache location on OS X is not the same as on Linux.

It should be in a subdir, can you try running:

$ ls -al ~/Library/Caches/restic/2c128c6e*/data/82/824981796608e9e4d16bd6534e461c304f7e4d876106fa4e56206381be698d0f

True, that’s also a safety feature. Hm. [quote=“Vartkat, post:3, topic:869”]
I was wondering if a
$ restic cat SanshotID | grep TreeNumber
on each snapshot would give me the answer to where is the culprit tree ?
[/quote]

In principle yes, that’s what the find command in the PR does. But you need to follow all the subtrees, which may be a bit tedious :wink:


#5

For the 824981796608e9e4d16bd6534e461c304f7e4d876106fa4e56206381be698d0f, the answer is no, the only one in 82 is 82d91fc3d1eaa7e5670ce66314e43eb3f43801a7b90928d631ae36c74cf408a4

For the result of index-rebuild then check

$ restic  -r s3:https://us-east-1@s3.wasabisys.com/my-backup rebuild-index
repository 2c128c6e opened successfully, password is correct
counting files in repo
[1:34:27] 100.00%  132885 / 132885 packs
finding old index files
Save(<index/5ec641491b>) returned error, retrying after 706.47937ms: client.PutObject: unexpected EOF
saved new indexes as [d75eed32 b81267d2 dfbd054a ec8e127c 18459e46 a3d2df2c bb80c1ba 87ff26fa a3963e1e 0ebcd781 c716ce75 f488524f 2b2008ab a3ef61b8 05066f9d 3a4ab0da 78900c8f 438e0d6c 33e8ffaf 59f9a888 6ea2909e aa975a3a 8c624eb6 8a3f980d 46968c3f 4e494986 27d428aa a936c250 c7aa0320 7c7b7364 2a5a11d0 d6f11e09 3cd8d7c5 e69c26ff ab4c2b1a f80e9bb6 70adeef6 0362f71c 71613a00 5ec64149 69b8de79 28e75d2d 585c461a 8dcbb79d 6f2ff5be]
remove 57 old index files

$ restic  -r s3:https://us-east-1@s3.wasabisys.com/my-backup check
using temporary cache in /var/folders/8z/rnjlwcm15pv1rbhq2q34mmr00000gn/T/restic-check-cache-179128309
repository 2c128c6e opened successfully, password is correct
create exclusive lock for repository
load indexes
check all packs
check snapshots, trees and blobs
no errors were found

So it seems OK, and it seems that’s really a cache problem.
Thanks for the time spent of this


#6

A bit of followup…

Restic is running since more than 8 hours now. It was taking from 9 to 25 min before…

Question -> Wasn’t the --cleanup-cache option supposed to cleanup the cache I moved (i.e. ~/Library/Cacches/restic) ?


#7

I think this is normal after wiping out the local cache. Restic has to get everything from the remote, which, especially with slow networks or backends, can take a while. (My own Internet speed is very slow and it’s not unusual for backup operations to take a day or more. I think it’s OK for backups to be slow, as long as they’re consistent.)

I believe this only deletes cache files/folders that are old and haven’t been accessed in a while (but someone could verify that).


#8

restic will rebuild the cache, so it’ll pull down the files it needs. That may take some time, but more than eight hours seems excessive. Do you maybe start restic different or with different paths?

If maybe congestion on the line is the problem, so that downloads are very slow when restic uploads data in parallel, you could try running restic ls latest before a backup. That will also populate the cache, but won’t upload any data while doing so.

Each repository has its own cache directory below ~/Library/Caches/restic named after the repository ID (just a random hex string). When restic accesses a repository and uses the cache, the timestamp of the directory is updated. The --cleanup-cache switch makes restic delete directories in the cache directory which have a timestamp that is older than (I think) 90 days, so cache directories for repositories not accessed in a long time are removed.

Within a cache directory for a single repository that is actively used, the only cleanup is running prune: When data is removed from the repository, it is also remove from the cache.

I hope this makes it a bit clearer how restic’s cache works :slight_smile:


#9

I’ve just re-read everything in this thread and I’m fairly certain I know what has happened, based on the restic output you pasted:

  • The first run of rebuild-index printed an error that a file could not be listed, the error message was just EOF: pack file cannot be listed 824981796[...]: EOF
  • This file was not included in the indexes written by rebuild-index
  • The run of check complained that trees could not be found in the index: tree 1b3c7305[...] not found in repository
  • I’m certain that the missing trees were contained in the file that could not be listed, because the next run of rebuild-index and subsequent check completed without errors

I got distracted by the EOF error message which happened to be the same for the issue Matt reported, which was caused by a truncated file in the cache. But now I think the reason for this was a transient network error.

I’ve looked at the code and it turned out that listing files in the repo (reading and decrypting the header of each file) is not retried on error, in contrast to almost all other backend operations! For rebuild-index, I think restic should retry this, and abort if it cannot read a file after some retries.

So, long story short: There’s a bug in restic: listing the contents of files is not retried.

I hope to find some time on the weekend to work on this, it’d be really helpful if somebody could build a GitHub bug report issue from this forum post (so we can track it more easily).

Thanks for your patience!


#10

Final …Tada !

Files: 1647 new, 133 changed, 284727 unmodified
Dirs: 0 new, 1 changed, 0 unmodified
Added: 1.656 GiB

processed 286507 files, 227.973 GiB in 6:46:38
snapshot e4154d6b saved

Next is

> Files:           0 new,    39 changed, 286468 unmodified
> Dirs:            0 new,     1 changed,     0 unmodified
> Added:      129.146 MiB
> 
> processed 286507 files, 227.973 GiB in 23:52
> snapshot b16bcd93 saved

next is

Files: 1 new, 18 changed, 286489 unmodified
Dirs: 0 new, 1 changed, 0 unmodified
Added: 61.950 MiB

processed 286508 files, 227.970 GiB in 10:57
snapshot 1438e1d9 saved

So what happend ? Dummy was I ! A diff shows me that 1.6Gb are files from…
The old cache dir I moved on the desktop.
So duration was logical and has nothing to do with the cache needing to be rebuilt


#11

This issue is now tracked as


#12

Turns out, this is indeed the same issue, the local cache on your machine somehow had a truncated file in it. Thankfully it’s easy to detect and I’ve pushed a fix, see https://github.com/restic/restic/pull/1946