Restic backup errors after running check --read-data

teddy · August 28, 2020, 1:27am

As a new user to Restic I have had mixed experience. Most repo’s have backed up okay, but one repo has always produced errors. I started out using Borg and got errors on a similar backup process, which made me then consider “restic”. I have found restic to be faster on the first and subsequent backup operations over TB of data.

Most of the files are video mkv files. Source and repo drives are local to the PC.
Version: | debug enabled | restic 0.9.6 (v0.9.6-349-gfe69b830) compiled with go1.15 on linux/amd64
I disable clamav-daemon.service

After running

/usr/bin/restic $RESTIC_OPTIONS \
					 --files-from "$INCLUDEFILE" \
					 --exclude-file "$EXCLUDEFILE" \
					 -r "${REPO}" \
					 backup "${SOURCE[i]}" >>"${LOGBACKUP}" 2>&1

and then forget/prune:

/usr/bin/restic "$GLOBAL_OPTIONS" forget \
				--keep-daily 7 \
				--keep-weekly 5 \
				--keep-monthly 6 \
				--keep-yearly 10 \
				--keep-last 7 \
				--prune \
				-r "${REPO}" >>"${LOGFORGET}" 2>&1

then a basic check:

/usr/bin/restic "$CHECK_OPTIONS" \
					 check -r "${REPO}" >>"${LOGCHECK}" 2>&1

I get no errors.

debug enabled
using temporary cache in /tmp/restic-check-cache-682081467
created new cache in /tmp/restic-check-cache-682081467
create exclusive lock for repository
load indexes
check all packs
check snapshots, trees and blobs
no errors were found

However, after running another check with --read-data enabled I start to get errors

debug enabled
using temporary cache in /tmp/restic-check-cache-921366440
created new cache in /tmp/restic-check-cache-921366440
create exclusive lock for repository
load indexes
check all packs
check snapshots, trees and blobs
read all data
Pack ID does not match, want eb297dda, got adeaa6db
pack cd462026 contains 1 errors: [blob 3: ciphertext verification failed]
pack 5b04aa31 contains 1 errors: [blob 0: ciphertext verification failed]

I have run $ memtest64+ without issues and smartctl output shows no issues for the source or repo drives, which are both local to the PC.

I have run a
$ diff dir1 dir2
between the mounted repo snapshot directory and the source directory and got one file that was apparently different. Copied that file back to my disk and did a $ cmp file1 file2 with no difference showing.

I’m at a loss to get this working error free.

Then I took the sub-directory that contained the problem file mentioned above and created a smaller backup of that directory into a new repository. The backup completed without errors. I then did a check on that repo with --read-data enabled. No errors were found for this process.

I also have this issue occurring for other drives for their backup to separate repositories when running $ check --read-data

It is hard not to think a bug in restic might be the issue. It is as if what it reads before saving is not what it actually saves so a checksum error occurs. I see it is not reached v1 yet so I guess it is still under heavy development. I’ve noticed other issues that I’ll raise in another post.

Running $ check --read data is a long process when the drives have TB of data (~30+ hrs). So it is difficult to have confidence yet while this occurs as it takes at least 2+ days to realise there is an error.

alexweiss · August 28, 2020, 10:33am

Honestly, those errors are most likely due to some hardware errors like defect memory or a failing disc. Especially as you have reported that two backup tools which operate pretty similar both independently produce errors you should follow the hints that this might be a problem on your hardware side.

About your errors: this means that the message authentication code (MAC) which has been computed from the encrypted blob content (when saving it) does not fit the MAC computed by the actual encrypted blob content.

cdhowie · August 29, 2020, 5:37am

I have to agree with @alexweiss. There have been other instances on this forum where someone reported repository corruption, sometimes aggressively blaming restic. I’m not saying restic is bug-free, but in those cases it was ultimately determined to be a hardware problem.

Note that diagnostic tools like memtest and SMART are incapable of proving that there aren’t any issues (you can’t prove a negative), they can only prove that there is an issue. Obviously, no errors from some of these tools is generally a decent indication that there aren’t problems, but when you have two different backup tools reporting corruption in their respective repositories, that is a very strong indication that the problem lies outside of the backup software.

My suggestion would be to exercise the drive that the repositories are stored on. Copy a huge tree of files (which you know will not change) to the disk then run diff -rq between the source and destination. In at least one other case on this forum, diff reported differences where the files on the backup drive were corrupt.

Outside of restic I’ve seen lots of cases of silent corruption like this, and they almost always point to memory or drive issues. Generally speaking, memtest is very good at finding errors. While any tool can have blind spots, I would trust it here.

On the other hand, SMART can only tell you when a drive is definitely failing. I have seen many dozens of cases where a failing drive did not show any issues whatsoever in its SMART report. I have seen absolutely no strong correlation between what a SMART report says and the condition of a drive, except when the SMART report says the drive is bad – in pretty much every case where I’ve seen that, it is.

tl;dr: I would bet the drive is bad. A good SMART report absolutely does not imply a good drive.

teddy · August 29, 2020, 6:26am

Okay, thank you for taking the time to reply with your valuable feedback. After reconsidering these issues and the weight of opinion I thought it best to rerun memtest86+ but this time let is go for much longer (at least 3 hours). I managed to get 3 lines of failing memory addresses, which is not great for me, but good in the sense that it is a likely culprit. It is probably time for a new box. The RAM is older DDR2 Non-ECC. I guess ECC would provide a better assurance on these things.

By way of background, I’ve been around computers a long time now and this is the first time memory has caused an issue like this for me. Perhaps because of the unique characteristic of deduplication software the memory has never been tested as rigourously as this type of software provides. I did read earlier in the forum/paragraph in the user guide about hardware issues rearing their head, but one never thinks it will happen to them. It is a rapid learning curve or a rude awakening that your hardware is second rate or failing. It probably also means that my files are slowly being corrupted by memory read/write errors.

I would suggest that in the introduction of restic or the top of FAQ, a big red warning that highlights how Restic can bring up hardware errors that you never thought existed becuase of its stress testing/rigous data processing mechanism.

The big issue is that it takes a whole backup and $ check --read-data cycle to find that out. The regular check did not hightlight this issue. Would it not be possible for a check-sum error to be reported as the backup is occurring? I guess one problem with that approach is that it would slow the process down, in some network cases, greatly. My argument is that being a critical issue it is best to know each time and every time, but at least for the first backup that the data integrity is okay. Speed is secondary to integrity in my opinion. Alternatively, it also perhaps highlights that the -read-data check is actually a critical test to regularly perform.

teddy · August 29, 2020, 6:38am

Yes I recently lost a 7year old drive that reported as okay. It did show some errors though in its statistics, which I now monitor regularly, such as (drive not related to this topic issues):

  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       1
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       73

The last three lines here aren’t looking good being not equal to zero even though the drive status is ‘PASSED’.

cdhowie · August 29, 2020, 6:45am

Depending on the age of the drive, it may not be cause for alarm. Offline_Uncorrectable refers to unrecoverable read errors (UREs) and every drive is expected to have some of those during its lifetime. Manufacturers publish what they consider to be the worst URE rate the drive should have and not be considered defective. Many consumer drives are rated for a maximum of 1 URE per 10¹⁴ bit reads, which equates to one URE for every 12.5TB of data read from the drive.

If you have read more than 12.5TB of data from this drive then having a single URE does not indicate a problem with the drive.

alexweiss · August 30, 2020, 6:54am

I wouldn’t say that restic is more stress testing than other software. But yes, it is testing much more rigorously which is part of its data format. In fact there are several cryptographic checksums computed and saved, e.g. when considering a pack file, first all blobs and the pack header are encrypted and MACed, second, all plaintext blob contents are SHA256ed and third, the total pack is again SHA256ed. If one of these checksums doesn’t match, restic will complain.

While it is easy to a-posteriori propose one needed check which would have identified a specific hardware problem, it is pretty hard to a-priori add checks to cover a whole range of failure scenarios. In the end, restic is a backup tool and not a hardware-stress-testing tool. For this purpose there are memtest and the like which use systematical test patterns.

Moreover, every additional test would cost performance like re-calculation checksums or even multiply storing and reading data to the storage backend. IMO that should not be done by default. And depending on your own considerations, you are free to run restic check as it suits to your requirements.

I wouldn’t say so. It does check the integrity of the data stored in the storage backend. If your backend doesn’t guarantee the data integrity (like “just” a local hard drive), this test should be regularly performed. If you are using a cloud storage like S3, the storage provider does guarantee data integrity (and they do run regular tests on the data themselves), so this test could be safely omitted.

And if you want to perform a test against your memory regularly, it should be memtest or similar…