Backup interrupted, understanding Fatal: repository contains errors

rhaidiz · December 18, 2019, 1:28pm

Hi everyone, I’m fairly new to restic and I’m having a weird problem I couldn’t find an answer to.

I was connected to an SSH server and launched a restic backup on its external HDD when I had a power outrage on my machine and lost the SSH connection which of course meant that also the backup with restic was interrupted (stupid me that I didn’t use a screen).
Once the power came back again I connected and restarted the backup that finished. I then run a check that some some extra data and suggested to run a prune command to fix.
I also run a check --read-data which terminates like this:

Pack ID does not match, want 364ff35c, got 424d0a45
pack eca4ebb2 contains 1 errors: [blob 0: ciphertext verification failed]
[3:05:39] 100.00%  34342 / 34342 items
duration: 3:05:39
Fatal: repository contains errors

I’m not sure if there’s an actual issue with the repo or I’m just misinterpreting this message.
I’ve seen others with similar issue but mainly the problem was that the check could not complete while in my case it seems like the check is completing but still it says the repository contains errors.

Any suggestion?

cdhowie · December 18, 2019, 5:13pm

Most likely, pack eca4ebb2 was in the process of being written when the power was interrupted. Try temporarily moving this pack outside of the repository and re-running check. If there are no errors, you can safely delete it.

rhaidiz · December 19, 2019, 8:15am

Thanks for the suggestion but I’m not sure how should I go to move a pack outside the repository.

cdhowie · December 19, 2019, 8:16am

ssh to the server holding the files and mv that pack somewhere else, like your home directory or /tmp (provided /tmp doesn’t get cleaned automatically on boot).

rhaidiz · December 19, 2019, 11:14am

I’m not sure exactly where the packs are located, I believe they are inside /data, correct? In that case, I moved eca4ebb2[…] away from the repo, launched a rebuild-index and now a check --read-data and doesn’t seem to be too happy.

error for tree a373a20a:
tree a373a20a: file “IMG_20190330_143200.jpg” blob 1 size could not be found
tree a373a20a: file “IMG_20190330_143206.jpg” blob 1 size could not be found
tree a373a20a, blob c474ae0f: not found in index
tree a373a20a, blob d1dc9234: not found in index

Am I doing this wrong?

cdhowie · December 19, 2019, 7:55pm

Hmm, then based on the error message, a single blob in that pack got damaged somehow. I would strongly suggest running a memory test on your system. This doesn’t sound like a symptom of an interrupted upload.

If you run sha256sum -b on that pack file, the checksum should match the filename. Is that correct?

rhaidiz · December 19, 2019, 10:23pm

Yes, if I run sha256sum -b of the pack file the checksum matches the filename. You’re saying this isn’t a symptom of interrupted uploading but
maybe there has been a misunderstanding in my post. I wasn’t uploading from my machine to the SSH server, I was backing up the SSH server itself on an external HDD that is connected to it. When I lost power that was on my machine which terminated the SSH connection that was running restinc on the server.

cdhowie · December 20, 2019, 4:50pm

Either way, the pack’s checksum matches but the blob contained in the pack isn’t matching its checksum. If the pack’s data was corrupted during or after being written to disk, the pack’s checksum wouldn’t match either.

The most likely scenario is that the blob’s data or the blob’s checksum was corrupted in-memory before being written. This is why I’m suggesting running a memory test on the system.

rhaidiz · December 20, 2019, 9:49pm

I’ll run a check.
In the meantime can you details how exactly you got to this conclusion? I believe I’m missing some knowledge about how restic works.

cdhowie · December 20, 2019, 10:35pm

Each pack file contains multiple pieces of data. These can be file contents (blobs) or directory listings (trees). The pack contains a SHA-256 checksum for each such bit of data it contains.

The pack itself is then hashed with SHA-256 and the pack is saved under this filename.

In your situation, you have a pack whose checksum matches its filename, but an item within the pack whose checksum does not match the checksum recorded in the pack.

This means that either the data object or its checksum got corrupted somehow. However, since the pack checksum is valid then the pack’s checksum was computed after the damage was already done. If you flip a bit somewhere in the pack data after the pack’s checksum has been computed, both the damaged object and the pack’s checksum would not match.

The most likely way for this to happen is for the data chunk or its checksum to be damaged either in memory or during writing the temporary pack file. Then when restic computes the checksum of the pack to determine its filename, the damage is accounted for in the checksum and so the pack’s checksum matches.

This is why I suspect a memory issue. It’s also possible that the disk holding the temporary directory also experienced some failure, but I don’t know if restic computes the pack’s checksum as its being written out, or if the pack file is read back in after being finalized to compute the final checksum.

Some final notes:

Restic pushes hardware pretty hard. It has uncovered many cases of failed memory/disks.
This could not be the result of restic terminating. Restic assembles pack files in a temporary location and then moves them where they go using the SHA-256 checksum as the pack’s destination filename. Terminating restic would either leave a dangling temporary file (outside of the repository) or a completely-written or partially-written pack file within the repository. There are no other options. If restic did leave a partially-written pack file then its SHA-256 checksum would not match its filename, and we have already verified that it does.

rhaidiz · December 22, 2019, 6:36pm

Thank you very much for the explanation, this is all very interesting.
Yesterday and today I could finally do a RAM test and disk check on this machine. The RAM test seems fine:

While the disk test showed something strange. But first, I feel like it might be worth giving some context. The machine I’m referring to has 3 HDD:

Internal sata WD 750GB
Internal sata Seagate 1T
USB 2.0 WD 1T

The external WD 1T is where I’m backing up stuff located into the Seagate 1T while the operating system (debian 8.11) runs on the WD 750GB. The WD 1T is an NTFS drive. Why is it NTFS? well, because I forgot it was which is nice so I will remember to turn it into ext4. All the other disks are ext4.

I run fsck on all the ext4 disks and ntfsfix on the NTFS disk, no errors there. I run smartctl -H on all drives, no problem a part from the Seagate 1T which returned the following:

smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-10-amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Please note the following marginal Attributes:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
184 End-to-End_Error        0x0032   092   092   099    Old_age   Always   FAILING_NOW 8
190 Airflow_Temperature_Cel 0x0022   056   042   045    Old_age   Always   In_the_past 44 (Min/Max 19/44 #1934)

I check for the meaning of End-to-End_Error and it seems to be a pretty serious issue. However, I also found some posts of people complaining about this error with brand new Seagate drives, and someone posted a reply from Seagate who says:

Our SMART values are proprietary and do not conform to
the industrie standard. That is why 3rd party tools cannot correctly
read our drives.

Ok, so I go the Seagate website and find the SeaTools Enterprise Edition for Linux and run a test with that and here is the output:

Drive /dev/sg0 does not support DST - generic short test will be run
Starting 100 % Generic Short Test on drive /dev/sg0 (^C will abort test)
	-Starting 30 second sequential reads from block 0 on drive /dev/sg0
	-Starting 30 second sequential reads to end of disk on drive /dev/sg0
	-Starting 30 second random reads on drive /dev/sg0
	-Starting 30 second random seeks on drive /dev/sg0
Generic Short Test PASSED on drive /dev/sg0

It seems ok, which makes me quite confused is there anything I’m missing?

rhaidiz · January 20, 2020, 2:52pm

Hi there, just wanted to follow up on this issue if perhaps you have anything else that might be worth checking that could explain the repository containing errors.

MichaelEischer · February 2, 2020, 12:23pm

Can you check that the sha256sum of the pack 364ff35c matches its content? This could be a incomplete pack file.

Regarding error for tree a373a20a: In case you still have the files which are listed in the error message, you can try to start a backup run with the --force flag that covers the affected files. restic will then reread the file contents and add the missing blobs back to the repository.
That is the recovery flow is like this: move the damaged pack files somewhere else, rebuild the index to forget these pack files, backup the affected files again.

rhaidiz · February 3, 2020, 7:03pm

I did check the sha256sum and the sha256sum of the pack maches the name.

I honestly don’t really care much about the backup in this case because, I don’t have many snapshots so I can just create a new repo and that would be fine for me. Still, I would like to understand why this happened when apparently it shouldn’t have, and the only anomaly I experienced was the loss of power from my PC that was sshed in the server doing the backup which shouldn’t be the issue.

ProactiveServices · February 4, 2020, 1:35pm

smartmontools and vendor SMART tools cannot be trusted to confirm a disk with just their overall “health” result. I’ve replaced hundreds of broken disks - some very broken! - and only seen an overall “health” failure twice!

Ask Seagate’s software what the SMART values are, and if it also lists VALUE for that attribute as lower than THRESH, then consider the drive faulty.

Myself, for those figures I’d trust the RAW_VALUE and assume the disk is suffering failure.