How to debug restic integrity errors?

I have few tens of GB of data in a restic repository. Every time I run

restic check —read-data

I find numerous data integrity errors of various sort. My understanding is that bit flips should be rare and I am not sure bit-flip could be the cause of these errors.

I ran “memtester” and “smartctl” and I don’t find hardware faults. I run restic 0.12.1, on Ubuntu 20.04. The client’s file system is ext4 with SSD, and destination file system is btrfs with RAID (so no chance of bit flip on remote), backend is SFTP. None uses ECC RAM.

Here are several type of errors frequently encountered:

  • Pack ID does not match, want 22e17e03, got 1f564ae7

  • Pack 70133809 contains 1 errors: [Blob ID does not match, want 743eae2c, got c23d20d1]

  • Error for tree 5eb7199f:pshots
    invalid character ‘\r’ in string literal

Pack and blob hash mismatches (first two errors) are very common. Usually I get handful of them with few tens of GB. But strangely in a TB repository, I get over 30 of these type of errors (but this could be a special case, that I suspect is due to permission problems).

I have several questions.

  • What are the usual cause of these errors?

My initial guess is: network interruptions, back up interruptions, background processes (such as Dropbox daemon, or browser) writing to file system while restic is backing up, permission problems preventing restic reading some files in source or destination making it to somehow output a misleading/unhelpful error message, problems with restic software, and bit flip due to faulty hardware.

My guess is that, these errors also occur with any data transfer tool, such as rsync, but they go unnoticed there. But I am not sure.

  • How to debug these errors?

For each error, how can I find directory and files affected? So far, I am using the following comments.

restic -r repo find --pack 70133809

restic -r repo cat pack 70133809

restic -r repo cat blob 1e514aa8

restic -r repo find --show-pack-id --tree 80b7199

Playing with these commands, sometimes I find the affected file. Usually they are weird files, like dot files in Dropbox, files with very long names such as saved HTML files, and sometimes videos or MP3’s.

What’s the step by step way to debug restic errors?

It would be good to have a utility or guide to debug restic errors. You would run something like “restic find error PACK-ID” and it will provide useful information.

Which raid level?

Raid 1.

In general, RAID provides some level of parity, making bit flip much less likely, which was my main comment.

I doubt bit flip is the issue.

I was thinking about raid 5/6, which is currently marked as unstable by developers because of known errors.

This was interesting - can you elaborate on in which way you mean that RAID 5/6 is unstable? Which developers say that, and do you have a reference to it? I’m not aware that these are unstable.

1 Like

Raid5/6 is widely used. It would be unusual if it’s officially declared as unstable!

There are quite some articles explaining why you should generally avoid RAID5 (e.g. this article) but I was referring to the Btrfs Wiki and this article on phoronix.

btrfs raid 5/6 is prone to errors if you have a power failure and a disk failure at the same time; use of a ups can eliminate the first. It is not the only file system with this problem, but in general people take shots at it because it is a new file system (born in 2007, marked stable in the linux kernel is 2013).

btrfs is capable of detecting parity errors, and the various raid levels can automatically repair errors (I believe a not raid configuration is capable of repairing a single bit flip).

It is not just a file system, it is also a file manager. The term raid is used, but it is different than the common perception of raid: raid 1 means there are two copies of data on different disks. An array can consist of two or more disks and the disks do not have to be the same size.

One feature btrfs has in common with restic is that most (if not all errors) are because of issues with underlying hardware (disks, ram).

I have used openSUSE Tumbleweed with btrfs on my main drive, and btrfs raid 1 on data drives, for a number of years without issue. I also use btrfs raid 1 on a Thecus NAS using OMV, without a problem). So far I have a single instance of an error being detected and repaired.

An interrupted network connection or backup in general only cause “incomplete pack file” warnings. Background processes could cause restic to read inconsistent data, but that won’t damage the repository. Permissions problems on the backend would completely prevent reading some files from the backend which leads to other error messages. Permissions problems during backup would have been reported properly and also cannot lead to check errors. Problems with restic are always a possibility, but the kind of error messages you see are, in my experience, usually caused by bit flips in hardware.

Restic is designed to keep the backup repository intact no matter when during a backup or other operations it or the network connection is interrupted (there are a few caveats with sftp which can require manually deleting some incomplete files in rare cases. But the errors you see are something completely different).

The first error type could be caused by data corruption in the backend storage or by bitflips during the backup.
The “Blob ID does not match” and “invalid character” errors can only be caused by bitflips on the host running the backup. They cannot be caused by some random bitflips in the storage backend (no matter whether it is a BTRFS raid or something completely different).

To trigger the last two error types, the bitflips must be introduced during or prior to encryption of the blobs. And that process runs completely in memory on the client which creates the backup. The “invalid character” error is a bit strange, as it can only be introduce either during the JSON encoding or while calculating the blob hash (which happens before encrypting the blob). In any case this hints at a memory or CPU problem. I’d recommend running prime95 to check whether the system stays stable under load.

Which kernel version runs on the client? We had some problems with data corruption and a kernel bug in the past, see How to fix "failed: ciphertext verification failed" running 'restic 'prune' - #4 by fd0 .

1 Like

Thanks for the response.

I am running Linux kernel 5.11.

There are 3 packs, whose names don’t match the sha256sum of their content. I run find —pack-id and I see that these packs contain several blobs that refer to some files in my Dropbox folder. I don’t care about these files and want restic to forget the corrupted files.

Following a GitHub discussion, I removed these 3 packs, copied index file and ran restic-rebuilt followed by restic backup. I now run restic check and I get a whole bunch of errors that:

tree kuhwrvjh file avddhtrd blob kigstukgdghj not found in index.

Any idea how to get rid of these errors?!

I removed local cache and rebuilt the index several times, which didn’t help.

I don’t care about data in the lost packs. I had therefore removed them from my Dropbox. It seems restic couldn’t recreate the missing data from source, and complains that some data is missing.

There are a lot of blobs. Forgetting all snapshots that are affected is probably not a good idea (there are tens of blobs. Also, you don’t want to forget a whole snapshot just for a bit flip).

Another question.

When a pack ID does not match, the content of pack file has changed (assuming bit flip is not in the hash).

A pack file is concatenation of items:

[blob type | blob ciphertext | MAC ]

followed by an encrypted header.

When I type find —pack-id on a damaged pack file, I get a number of blobs and a list of files with their real file names in which these blobs are used.

Do these files contain only intact blobs? If the error is in blob ciphertext, the MAC won’t check out, and restic would silently ignore those blobs?

How can I explore a damaged pack file, list damaged and healthy blobs in it and remove only the damaged blobs (not the whole pack). The pack header and pack file name need to be updated too.

Can we add a —repair option, to fix the integrity errors using data in other snapshots or from source, or simply removing affected data?

Please have a look at https://github.com/restic/restic/issues/828#issuecomment-706186047 .

The idea there is to salvage as much data from the damaged pack files as possible, remove the broken pack files and add the salvaged data back to the repository (in new pack files).

find does not check the integrity of the reported files/blobs. Only the check command verifies the data integrity.

Once you’ve run rebuild-index, later backup runs will recover the missing blobs if the original files still exist. As blobs are only stored once in a repository and are shared between snapshots, there are no other snapshots which could provide the missing blobs. Removing the affected data from a snapshot will require the creation of a new snapshot. See the above link for more details on that.

Based on the errors reported above, these are not the only pack files with damaged blobs. Make sure to run check --read-data in the end to verify the whole repository content.

You are right! Missing data cannot be found from other snapshots due to deduplication ( I don’t know how I mistakenly said that!), so recovering from source or just removing affected files is the way to go.

Let me follow the GitHub page.

It would be good if this was a repair flag, not a manual process. Could such repair feature be built into restic?

It seems easy to add to code: after backup check the hashes for newly created incremental blobs, and if something fails, copy it again from source right there, or at least in the next backup.

Borg has something similar. It was scary and long, but it healed all checksum errors!!

I’m not completely sure what you’re suggesting. If restic fails to upload a new blob during a backup, then the backup run will fail. A later backup will try to upload the blob again. If a blob is missing from the repository index, then the backup will notice and upload the blob again.

The borg documentation mentions that borg check --repair replaces missing blobs with all zero blobs. The data format in restic currently doesn’t allow for such (temporary) replacement blocks without messing with the self-healing described above.And without damaging future snapshots.

There’s currently no automatic repair command, as there are lots of corner cases the handle if we want to ensure that a repository is not damaged any further. And so far (at least in my impression) the reported repository damages are usually caused by some hardware or other underlying problem, which has to be fixed first to ensure a reliable backup. That doesn’t mean that there won’t be an automatic way to repair a repository in the future, but it’s currently not a priority.

Thank you Mike for clarification.

I should probably learn more about how restic backups work under the hood, before asking more questions!

I meant, suppose that a bit flip occurs and restic creates a pack file such that the sha256sum of that pack file does not equal to the pack file name. After creating a pack file, restic could verify the checksum right after the pack is written to the repository (sort of an automatic check —read-data, but for newly created packs or blobs so that it’s fast). If there is a hash mismatch, create the file again now that source data still exists. In other words, simply copy the data twice during the back up for problematic packs, to ensure that once a back up is finished, there is no integrity errors occurred during the back up time.

Of course, bit flips could occur later on in encrypted data at rest. In this case, restic could recopy data from source as soon as it’s aware of this, and don’t require a complicated surgical process performed by the user. So, restic will be in charge at host could handle problems at host, while cloud providers would handle integrity of the repository at rest on remote.

In other words, ok a bit flip has occurred due to hardware faults, but the source is available, what prevents restic from automatically copying the correct bit/data, ideally right then or at least in the next back up?

One trivial solution is to compute a pack twice, or store a pack twice. That’s not space efficient error correction coding, but ensures recovery.

I am probably missing important points and I apologize!

I think laptops will not run ZFS with raid and ECC RAM any time soon due to space limitations. So integrity errors and damaged repositories will be with us.

I don’t know what’s the experience of other users with damaged repositories.

That sounds quite similar to what has been implemented for cloud/rest backends in https://github.com/restic/restic/pull/3246 (not yet included in a release). Reading a pack file again from the local disk or via SFTP could show that the pack file content doesn’t match the hash or not. The problem is that the read will be served from the in-memory page cache and not from the harddrive on which the pack file is stored.

The problem is not so much in correcting an error which occurs during the current backup run, but rather in detecting it without a ton of overhead. Just checking the sha256 hash of a pack file is far from sufficient. That would only detect “Pack ID does not match” errors but not other types. It would probably be necessary to verify each individual blob and for tree blobs also to verify that these were correctly verified and only reference existing blobs etc.

restic won’t notice the bit flips until one either runs check, prune or restore. Neither operation is in a position to start looking for missing blobs. But I guess we could extend check or something else to provide the functionality of debug examine (with a bit more automation).

It’s probably much more reliable to just create two different repositories at two different storage locations. That would have the benefit that the risk of correlated disk failures (aka. both copies stored on the same disk) is much lower.