Ciphertext verification failed during restore, `restic check` found no errors

I am a new user to Restic. I have been reading the docs and have begun transitioning my backups to Restic (from CrashPlan).

This past weekend was my first time testing the restore functionality on an entire snapshot. I was surprised to see two files failed with the following error message:
ignoring error for /path/to/file: decrypting blob <file hash> failed: ciphertext verification failed

I ran restic check on the repository both before and after, no errors were found in either check.

It seems Restic made a best effort to restore the files anyways, but the files are indeed different. The file sizes are smaller than the originals, and obviously the hashes don’t match.

If I had to guess, this was caused when the first time I tried to backup the host the process was interrupted by a power outage. After restoring power, I performed the backup operation again, then did a check and prune. I did not want to throw away the repository and re-init because I had other snapshots from other hosts already saved in the repository.

So my questions are:

  1. Why does restic check not catch this?
  2. Is there a way, other than restoring, to detect errors like this in a repository?
  3. Now that this damage has been detected, can it be repaired or do I need to re-init the repository?

I am also in the process of backing up 12TB+ to B2 using Restic, so this raises concerns of whether or not that data is safe and the repository is trustworthy. I can’t guarantee that backups will never be interrupted by power outages or network outages.

1 Like

Hey, welcome to the forum!

restic check will check the basics by default, which means the structure and if all references files are there. It may happen that files stored in the repo (on disk, in the cloud, …) have been modified. We’ve seen a few cases of broken hardware that was detected by restic.

Since reading all the flies and checking the hash is expensive, it’s not done by default for restic. You can run restic check --read-data, which will download and read all data, and will complain for that file.

If it’s just one file in the repo, you can also download/copy that single file and run sha256sum on it. If the hash is different from the file name, the file has been modified.

Can you try that and report back?

2 Likes

That’s such a great reply. Clear, concise and to the point, not to mention very informative - perfect example of a --read-data use case!

The blob ID in the error message begins 2857a16.
The sha256sum of the original file begins 92e2528.

Running find /srv/array0/restic_backup/ -iname '2857a16*' -o -iname '92e2528*' returns no hits on either.

Both affected files are rather large (2.8G and 8.1G), so I suppose the files might have been broken into multiple smaller blobs?

I tried to restic --no-lock -r /srv/array0/restic_backup/ cat blob 2857a160da9758723af784bc7fb5a90d48b553872f68246e4e38fff3409813c7, but it complains the repository is locked (by the restic check --read-data I’m running, which is about 25% completed so far, I’ll post the result when it completes.)

Since the initial backup was prematurely interrupted by a power outage, is it possible that the upload of these files was in-progress when the connection was severed? Do files get written directly to the repo, or are they written to a temporary location and then atomically moved into the repo only once the upload is successful?

The output of restic cat on the first blob it complained about.

adam@tardis:~$ restic --no-lock -r /srv/array0/restic_backup/ cat blob 2857a160da9758723af784bc7fb5a90d48b553872f68246e4e38fff3409813c7
enter password for repository: 
password is correct
decrypting blob 2857a160da9758723af784bc7fb5a90d48b553872f68246e4e38fff3409813c7 failed: ciphertext verification failed
github.com/restic/restic/internal/repository.(*Repository).loadBlob
	src/github.com/restic/restic/internal/repository/repository.go:162
github.com/restic/restic/internal/repository.(*Repository).LoadBlob
	src/github.com/restic/restic/internal/repository/repository.go:600
main.runCat
	src/github.com/restic/restic/cmd/restic/cmd_cat.go:175
main.glob..func3
	src/github.com/restic/restic/cmd/restic/cmd_cat.go:24
github.com/restic/restic/vendor/github.com/spf13/cobra.(*Command).execute
	src/github.com/restic/restic/vendor/github.com/spf13/cobra/command.go:698
github.com/restic/restic/vendor/github.com/spf13/cobra.(*Command).ExecuteC
	src/github.com/restic/restic/vendor/github.com/spf13/cobra/command.go:783
github.com/restic/restic/vendor/github.com/spf13/cobra.(*Command).Execute
	src/github.com/restic/restic/vendor/github.com/spf13/cobra/command.go:736
main.main
	src/github.com/restic/restic/cmd/restic/main.go:69
runtime.main
	/nix/store/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee-go-1.9.2/share/go/src/runtime/proc.go:195
runtime.goexit
	/nix/store/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee-go-1.9.2/share/go/src/runtime/asm_amd64.s:2337

I don’t see anything enlightening there. I guess I don’t know how to find/check the single file that has the problem.

The output of the restic check --read-data:

create exclusive lock for repository
load indexes
check all packs
check snapshots, trees and blobs
read all data
Pack ID does not match, want 9b5a3cf4, got a84e38c1
[1:42:40] 100.00%  76784 / 76784 items
duration: 1:42:40
Fatal: repository contains errors

Other than the Pack ID mismatch, which I don’t even know is related, I’m not sure if there’s any new useful debugging information. Just that my repository contains errors.

I feel like since the original error message told me which files had trouble, and specifically the blob IDs that failed verification, there should be a straightforward way to find and replace the affected blob(s) with the correct content. Maybe a forced re-insert.

But for this to find out one would have to run restic check --read-data right? Since there were no errors in his tests when running restic check.

@moritzdietz

As @fd0 wrote, regular check does basic checks only, it doesn’t read through all the data in the repository to verify that no bits in it were changed.

However, you can run check --read-data to tell restic that it should also read through all the data in the repository and verify all of its contents (which it does by comparing the checksum of the files’ contents with the files’ names), so that you can detect if the content of files changed (if it did, the file(s) in question will have a content checksum which no longer matches their filename).

But what @fd0 suggested is that, to check a specific file, you can download that specific file, checksum it locally, and compare the checksum of it with its name. Effectively the same that check --read-data does, but for that specific file only, so you don’t have to run check on the entire repository just to check one file in it.

Does that help/make sense (I’m not sure what your actual question was)?

@rawtaz: What I was trying to get at is that you have to first know which file is broken in order to download it and compare hashes with the original file/s.

As described here

So what I concluded was: to know which file is affected you would have to run restic check --read data.
Then download the file and check it yourself. OR to know if your repository on a remote backend is in good health you have to add the --read-data parameter as your only way of checking for remote issues.
Or am I misinterpreting something?

/e
Like I don’t want to backup for months and when I restore notice that a backup that I took 4 weeks ago of that file broke the restoration of that file in question resulting in me not being able to restore it. Or is that a scenario that would not occur? Maybe I am mixing things up and taking this off-topic now.

Ah, I see what you mean.

Indeed, if you don’t know the file you want to investigate you can’t download it. In this case, @Sitwon knows which file it is because of the error messages he received earlier (in this case it’s 2857a160da9758723af784bc7fb5a90d48b553872f68246e4e38fff3409813c7). So that’s a situation where you might benefit from checking the file manually.

Backing up a (source) file multiple times should definitely not make it less restorable. If a blob is already in the repository restic will not save that same blob again, it will just reference it from the additional snapshots. Actually, restic only writes files to the repository, it doesn’t modify them. That’s part of it trying to be as cautious as possible. So there’s really nothing restic would do to an existing blob file, that could harm it.

Hardware issues such as disk corruption can however make some of your repository files and blobs have incorrect data in them (after they were created). There’s currently nothing restic can do about that, except give you the ability to detect when this is the case.

And since no one, including you and restic, will get a notice when e.g. silent disk corruption happens, as you say the only way to know for sure that your repository is intact is to do a full check using check --read-data. You can do this on the repository server if you have access to it.

Of course you should also try restoring from it every now and then, that’s good practice.

Well, the problem I’m having is that I know the blob ID that is corrupted, but I can’t figure out which file in the repository contains that blob. There is no file in the repository which seems to match the blob ID from the error message.

I think the problem I had is that the first time I tried to backup the affected files the process was interrupted, leaving behind a corrupt partial backup. Subsequent backups saw that a matching blob ID existed and skipped over it, so no amount of subsequent backups would ever correct the corruption in the repo.

So upon further digging, I now understand what restic check --read-data was reporting. The pack file 9b5a3cf4 is corrupt and has a sha256sum of a84e38c1.

adam@tardis:~$ restic -r /srv/array0/restic_backup/ cat pack 9b5a3cf4ccea5e20557c62e6e9eef27cf529ac13a42d65b7e10592d847976042 | sha256sum
enter password for repository: 
Warning: hash of data does not match ID, want
  9b5a3cf4ccea5e20557c62e6e9eef27cf529ac13a42d65b7e10592d847976042
got:
  a84e38c1766fcb0fa23e0cb4d1c2ae9aab298515096b1ccaedf1b44a45146370
a84e38c1766fcb0fa23e0cb4d1c2ae9aab298515096b1ccaedf1b44a45146370  -

I still am not clear on how (if it is at all possible) to repair this without having to start over with a new repository. I am hoping it is possible, because if this repository contained several months or years of snapshots I would not want to throw away all of that history. Even if this was caused by a silent hard drive corruption (which I find unlikely in this scenario), how would you salvage the existing good data and unaffected snapshots?

Excellent questions. I feel someone with more insight into the restic internals and features should answer that.

I’m trying to grok the code to see if my suspicions are correct, but from what I can tell so it looks like the sftp backend tries to write the file directly to the destination path in the repository. (I could be wrong. I don’t have a lot of experience with Go and there’s a lot of code here to wrap my head around.)

To prevent corruption if the backup is interrupted, the safer approach is to write to a temp file, then move the tempfile to the target path. Because if you start writing to data/xx/xxxx… and the connection gets interrupted you’ll end up with a partially-written file. But moves are an atomic file system operation, so once the data is on the file system, it will either move and be complete, or it won’t move. There’s no risk of a partial file.

@rawtaz the ID 2857a160da9758723af784bc7fb5a90d48b553872f68246e4e38fff3409813c7 identifies the blob within the pack file, of which the file name is 9b5a3cf4ccea5e20557c62e6e9eef27cf529ac13a42d65b7e10592d847976042

Finding the pack for a blob is possible by looking at the index. You can list the index files via restic list index and then decrypt and print them to stdout via restic cat index ID, which can in turn be fed to jq for pretting printing:

$ restic cat index f94c98288155eca29801a585946e14d16e8f8e9a520346b0d5334e2263f120e5 | jq .
[...]
    {
      "id": "6ad6006670f01718e6b2ae6c9b4e1b3dc78c949e92531dcc7a75434b53d3e41d",
      "blobs": [
        {
          "id": "b3928295a547b90bd8ba92d84e3590d181a19fc4be5f1e7e576487ca56103864",
          "type": "data",
          "offset": 0,
          "length": 846437
        },
        {
          "id": "78d02070292e97dda5f65b175ece6a120a5d375d86917dd5a2aa789677a71e43",
          "type": "data",
          "offset": 2667780,
          "length": 604370
        },
        {
          "id": "69b24d1bda844f41710e45976904f6425ed0bfb57b7e8b736e9941261d879549",
          "type": "data",
          "offset": 3272150,
          "length": 927341
        },
        {
          "id": "13d54ce2efb071b64f3bbcabbec240ac6fb3b54ef52fda3d6e83898424395bcd",
          "type": "data",
          "offset": 846437,
          "length": 1821343
        }
      ]
    },
[...]

There you can see four blobs contained in a pack file. The same contents (roughly) is also contained in the header at the very end of the pack file.

To be honest, restic is not handling this situation very well. A solution would be:

  • Save all blobs in the same pack file which are unmodified to individual files
  • Remove the pack file
  • Run restic rebuild-index, so that the index is newly built from the available pack files and doesn’t contain any references to the broken blob any more
  • Then backup the saved blobs in the files from the first step. That will insert them into the repo, most likely as individual blobs.

Afterwards, there’s only the one single blob missing from the repo. If this blob is found somewhere during the next backup, it is saved again in the repo and everything is repaired. If it isn’t found, then all files referencing that blob cannot be fully restored.

We have some plans to address this situation by adding recovery information to the pack files, but these are only rough plans right now. You can find the discussion here: https://github.com/restic/restic/issues/256

You’re right, we’re directly writing the data in the local and sftp backends. We used to write to a tempfile and then move it around, but it turned out that this is not ideal for some file systems. Moving is not guaranteed to be an atomic operation, that depends on the file system. For several popular fuse-based file systems it’s not the case, for example.

What we’re trying instead is to tell the fs via fsync() to flush the remaining data to disc, and only at the very end add the contents of the newly written pack file to the index file. When anything happens and the process is interrupted, the pack file hasn’t been added to the index yet and when you start restic again, the data is uploaded again. The remainders are cleaned up by the next run of restic prune.

The prune operation (and rebuild-index) will read the header at the end of the pack file. If that is intact, the file has been written completely and can be used. Otherwise it is removed as a leftover from an interrupted operation.

In practice (even with terabytes of data) this turned out to work quite well.

Ok, I am attempting to repair my repo as you suggest. (Learning a lot about restic’s internals in the process, so that’s fun.)

It turns out the only two blobs in the pack file are the ones that were already reported as corrupt when I tried to restore.

List of blobs in affected pack file:

PACK_ID="9b5a3cf4ccea5e20557c62e6e9eef27cf529ac13a42d65b7e10592d847976042"
for idx in $(restic -r . list index) ; do
    restic -r . cat index $idx | jq '.packs[] | select( .id == "$PACK_ID" )'
done

My results:

{
  "id": "9b5a3cf4ccea5e20557c62e6e9eef27cf529ac13a42d65b7e10592d847976042",
  "blobs": [
    {
      "id": "2857a160da9758723af784bc7fb5a90d48b553872f68246e4e38fff3409813c7",
      "type": "data",
      "offset": 0,
      "length": 2630968
    },
    {
      "id": "33854820b635b590f4840438c0cc3d9f7f4f57c6696303a2124aa475fcd6b93e",
      "type": "data",
      "offset": 2630968,
      "length": 2353981
    }
  ]
}

By walking the tree, starting from the snapshot I knew contained the affected files I was able to find the affected files by their filename in the tree and confirm that each of these blob IDs were in their list of blobs.

I removed the corrupted pack file from the repository, and then backed up the original host again. Now it appears that the two affected blobs have been re-inserted and cat-ing the blobs to sha256sum produces the expected hash values.

I followed up with a restic prune and now it’s running restic check --read-data. I’ll update the thread if I see anything interesting in the output.

Thanks for the help!

I had thought about suggesting a par2-like parity feature, I guess you already had that on your to-do list. Being able to verify your backups are correct is fantastic. But being unable to heal a damaged repository is terrifying.

I’m currently bootstrapping a second, independent, repository on B2. I’m hoping that two independent repositories will afford me a bit more safety than simply cloning one repository to the cloud. If one becomes corrupt, the other might still have the data, not just a mirror image of the corrupted blobs. Still, backing up 12.7TB to B2 seems to be quite a slow operation. I would imagine adding parity data to that would only make it slower.

Part of the reason I’m skeptical of bit-rot as an explanation for the corruption is that this repository is stored on a md_raid6 array. I would expect that the parity in the RAID adds another layer of protection on top of the checksumming the drives perform as they read each sector.

That’s a fair point. If a file system like sshfs implemented ‘move’ as ‘copy, then delete’ the performance hit would be atrocious and it would undermine the protection we were trying to achieve.

Sounds like a good plan. Since restic will select an independent polynomial for use in the chunker (which splits files into blobs), so you will get different blobs for larger files.

That’s possible, yes. Depends on the algorithm and how much data it is, but yes, that needs to be uploaded…

But that does not cover e.g. bad memory. We’ve had a few (~6 or so) similar cases in the past, and most of them were later found to be caused by bad hardware, either storage or memory, e.g. https://github.com/restic/restic/issues/1403 We also had a report on IRC two weeks ago where a single bit was flipped in a pack file, so a single blob could not be decrypted. The user found it by re-encrypting the same plaintext while using the same IV and then comparing the ciphertexts. In this case it remained unclear how that happened.

Before starting restic, I was unable to imagine how often hardware issues occur. restic at least allows detecting them before you’ll need the data :wink:

1 Like

I’ve been thinking more about this problem.

Luckily I was able to fix it, even without parity data, because I still had access to the original data. However, in order to fix the problem I had to do some spelunking through the repo and manually remove the corrupted file.

It seems like if you know that a blob or pack is corrupt, it does you no good to keep it around. What I suggest is that when restic check finds errors like this, it should give the user the option to remove the damaged files (perhaps move them to a corrupted_blobs/ folder) and re-index the repository. That would, at least, give subsequent backups to opportunity to write new (hopefully correct) blobs into the repository. It would create the opportunity for the repository to heal organically, as opposed to forcing the damage to persist.

Also, once I knew the IDs of the affected blobs, there wasn’t an easy way to query where those blobs lived on-disk. I was able to find the pack with restic check --read-data, which reported the corrupted pack, but it seems as though it would have been faster to query the index to find a the pack containing a given ID.

A command that essentially does this (assuming the blob is in a pack):

BLOB_ID="2857a160da9758723af784bc7fb5a90d48b553872f68246e4e38fff3409813c7"
for idx in $(restic -r . list index) ; do
    restic -r . cat index $idx | jq '.packs[] | select( .blobs[].id == "$BLOB_ID" ) | .id'
done

You’re right, that’s an (apparently) common case that we need to handle a lot better. I’m thinking about having prune print a warning, but continue repacking things, leaving the bad blob out. That’ll cause errors to be printed during check, and the affected files which reference the blob won’t be restored properly, but at least the repo will heal itself when the original blob is still available.

We’ll get to it, eventually. So much to do, so little time :slight_smile:

1 Like