Fatal errors and error messages

#1

I have a repository which has been backed up to through rest-server. It’s on a CentOS 7 with XFS for the data storage, on a Dell T130 with H330 RAID controller, disks configured as RAID1.

It was fine when I ran check on the 14th of february, but yesterday the user alerted me that the backup runs were showing errors. So I ran a check:

$ restic -r local.folder --no-cache check
enter password for repository:
repository b16bf980 opened successfully, password is correct
create exclusive lock for repository
load indexes
check all packs
pack e163c9de: not referenced in any index
pack 3884bf31: not referenced in any index
pack 265abf15: not referenced in any index
pack 593fa10d: not referenced in any index
pack 56b696ff: not referenced in any index
pack 2c9d5e17: not referenced in any index
pack 97764abd: not referenced in any index
pack a7c5bf06: not referenced in any index
pack 98322d23: not referenced in any index
pack dfbdecbb: not referenced in any index
pack 08baa8fe: not referenced in any index
pack d0fc6658: not referenced in any index
pack 9f98b1d6: does not exist
pack 11975d72: does not exist
12 additional files were found in the repo, which likely contain duplicate data.
You can run `restic prune` to correct this.
check snapshots, trees and blobs
error for tree a67291cc:
  tree a67291cc: file "login.keychain-db" blob 0 size could not be found
  tree a67291cc, blob 0dd95604: not found in index
error for tree 851d61ab:
  tree 851d61ab: file "dynamic-text.dat" blob 0 size could not be found
  tree 851d61ab, blob 9eb5501a: not found in index
error for tree fff5dc21:
  tree fff5dc21: file "com.apple.finder.plist" blob 0 size could not be found
  tree fff5dc21: file "com.apple.systempreferences.plist" blob 0 size could not be found
  tree fff5dc21: file "knowledge-agent.plist" blob 0 size could not be found
  tree fff5dc21, blob 896f225a: not found in index
  tree fff5dc21, blob d4f75aae: not found in index
  tree fff5dc21, blob 91b5c27a: not found in index
error for tree 6d805ae8:
  tree 6d805ae8: file "knowledgeC.db" blob 1 size could not be found
  tree 6d805ae8: file "knowledgeC.db" blob 4 size could not be found
  tree 6d805ae8, blob d44d81e5: not found in index
  tree 6d805ae8, blob da9ac1a3: not found in index
error for tree 9d2fcdd5:
  tree 9d2fcdd5: file "appList.dat" blob 0 size could not be found
  tree 9d2fcdd5, blob 80511e29: not found in index
error for tree 43d9c396:
  tree 43d9c396: file "data.data" blob 0 size could not be found
  tree 43d9c396, blob 5d6b408c: not found in index
error for tree da43960d:
  tree da43960d: file "MicrosoftRegistrationDB.reg" blob 2 size could not be found
  tree da43960d: file "MicrosoftRegistrationDB.reg" blob 4 size could not be found
  tree da43960d: file "com.microsoft.Office365V2.plist" blob 0 size could not be found
  tree da43960d, blob df37ae02: not found in index
  tree da43960d, blob eceb1afa: not found in index
  tree da43960d, blob 18bf7106: not found in index
error for tree 1b1a9e35:
  tree 1b1a9e35: file "com.apple.preview.sfl2" blob 0 size could not be found
  tree 1b1a9e35, blob 92c1a75e: not found in index
error for tree 857282a5:
  tree 857282a5: file "ckks_analytics.db-wal" blob 6 size could not be found
  tree 857282a5, blob 308938a5: not found in index
error for tree 84064cea:
  tree 84064cea: file "com.microsoft.Office365V2.plist" blob 0 size could not be found
  tree 84064cea, blob 18bf7106: not found in index
error for tree a484d0fb:
  tree a484d0fb: file "appList.dat" blob 0 size could not be found
  tree a484d0fb, blob 80511e29: not found in index
error for tree 4bdf3657:
  tree 4bdf3657: file "knowledgeC.db" blob 1 size could not be found
  tree 4bdf3657: file "knowledgeC.db" blob 4 size could not be found
  tree 4bdf3657, blob d44d81e5: not found in index
  tree 4bdf3657, blob da9ac1a3: not found in index
error for tree 1beb2895:
  tree 1beb2895: file "ckks_analytics.db-wal" blob 6 size could not be found
  tree 1beb2895, blob 308938a5: not found in index
error for tree 95a0f74c:
  tree 95a0f74c: file "Recents-wal" blob 4 size could not be found
  tree 95a0f74c, blob 5fa9f41f: not found in index
error for tree f313ef1e:
  tree f313ef1e: file "com.apple.routined.plist" blob 0 size could not be found
  tree f313ef1e, blob 8bc4a3d5: not found in index
error for tree 4fe277f5:
  tree 4fe277f5: file "data.data" blob 0 size could not be found
  tree 4fe277f5, blob 76c03260: not found in index
error for tree 68c4a7f9:
  tree 68c4a7f9: file "data.data" blob 0 size could not be found
  tree 68c4a7f9, blob 5101f029: not found in index
error for tree e22a2c4b:
  tree e22a2c4b: file "data.data" blob 0 size could not be found
  tree e22a2c4b, blob 00c62455: not found in index
error for tree bb55d9cc:
  tree bb55d9cc: file "apple-device-log-20181129-1605.log" blob 0 size could not be found
  tree bb55d9cc, blob ba49b23b: not found in index
error for tree 4b028f6b:
  tree 4b028f6b: file "apple-device-log-20181129-1223.log" blob 0 size could not be found
  tree 4b028f6b, blob 357256ab: not found in index
Fatal: repository contains errors

In an attempt to check the disk system I first ran:

# xfs_logprint /dev/mapper/vg_data2-lv_data2 |grep -i -C10 error
xfs_logprint: /dev/mapper/vg_data2-lv_data2 contains a mounted and writable filesystem
xfs_logprint: unknown log operation type (2f00)
Bad data in log
Oper (91): tid: d23cf92c  len: 176  clientid: TRANS  flags: none
INODE CORE
magic 0x494e mode 0100600 version 3 format 2
nlink 1 uid 1000 gid 1000
atime 0x5c7d22ff mtime 0x5c7d22ff ctime 0x5c7d22ff
size 0x39d3f3 nblocks 0x3f0 extsize 0x0 nextents 0x1
naextents 0x0 forkoff 36 dmevmask 0x0 dmstate 0x0
flags 0x0 gen 0x5220332a
Oper (92): tid: d23cf92c  len: 16  clientid: TRANS  flags: none
EXTENTS inode data
Oper (93): tid: 0  len: 0  clientid: ERROR  flags: START COMMIT END
LOCAL attr data

============================================================================
cycle: 8        version: 2              lsn: 8,1719745  tail_lsn: 8,1719403
length of Log Record: 512       prev offset: 1719681            num ops: 2
uuid: 8d909a13-9a97-489c-9a7f-50731380be4e   format: little endian linux
h_size: 32768
----------------------------------------------------------------------------
Oper (0): tid: d23cf92c  len: 48  clientid: TRANS  flags: none
**********************************************************************
* ERROR: data block=1719745                                           *
**********************************************************************

Looks to me like there’s some kind of error in at least one data block. So I unmounted the partition and ran xfs_repair, yielding this:

# xfs_repair /dev/mapper/vg_data2-lv_data2
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
done

It’s the first time I’ve ran an XFS repair utility (first time I use XFS), and I’m surprised it doesn’t tell me that something is wrong, considering the xfs_logprint above did. I simply can’t tell by the above output if it found and fixed any errors, or not. Also, I looked for a lost+found folder but couldn’t find none, so I guess there were no “disconnected inodes” found (phase 6). Again, hard to know.

After that, I remounted the disk and ran restic’s check again, but the output is the same.

At this point I don’t know what might have caused this corruption. The server is rather new so I don’t think it’s bad disks, and even if one disk is bad the RAID should have dealt with that.

According to the iDRAC on the machine, there’s nothing wrong with the storage or anything else, everything in terms of health looks good. The server is of course using ECC memory.

The client sending the backups should at most be able to cause a complete interruption in the process of running the backup command (it never does anything else), but AFAIK this can’t corrupt the repo and will at most result in duplicate data having been uploaded.

The rest-server might of course have screwed something up, although I doubt it.

So, any ideas what might cause something like this (after all, the symptoms are individual between different types of corruption/causes)?

Do you think there’s a way to repair it, or should I just erase it and start over with a fresh repo?

Lastly, one thing that strikes me is that restic doesn’t give much information in terms of “this is how bad it is” and “this is what you should do next”. It just says “fatal”, it should probably say “it’s beyond repair” when it is. Perhaps it’s possible to improve the error messaging, what do you think?

0 Likes

#2

I had a similar problem a couple of months ago. Check the thread to see if it can help you.

0 Likes

#3

Thanks @Dj0k3. Im going to try the find --tree command to see if they’re the same snapshot or a lot of different ones.

0 Likes

#4

Run restic rebuild-index ASAP.

If the index says that a specific tree/data blob is present in the repository, the client believes this and will not reupload it. Rebuilding the index removes the bad entries and then it is possible that future backup runs can heal some of the damage by restoring the missing tree/data blobs.

If you don’t do this, backup clients will continue to deduplicate the missing objects and this means that new backups can also be bad.

This is pretty much always at the root of every filesystem (where it’s mounted), is that where you looked? If you don’t have one, you should make one with mklost+found ASAP. Filesystem repair tools are not going to create a directory and allocate space for directory entries when the state of the filesystem is in question. mklost+found creates the directory and preallocates some space for file entries so that the recovery tools have a known-good place to put stuff.

This definitely looks like a filesystem/disk/RAID-level error and nothing to do with restic.

Note that the restic find --tree approach may fail with useless output because restic find bails when a tree can’t be loaded. I have a PR to fix this; if a tree can’t be loaded, the patch will tell you what snapshot it was looking on behalf of, and then proceed. This implies that you can forget that snapshot to fix the repository, as that will cause the missing object to cease to be referenced.

0 Likes

#5

Tiny contribution to the thread: as far as I know this folder does not exist by default on xfs (in contrast to ext2/ext3/ext4/), it is created on demand by xfs_repair: https://www.systutorials.com/docs/linux/man/8-xfs_repair/#lbAH

0 Likes