Dropped packets while backing up causes errors in archive

My initial restic backup for a large disk does not pass check --read-data after a successful initial backup. I suspect it is due to some dropped networked packets, but I am surprised this can cause a corrupted archive. I have provided more details below in the format of an issue report, though I have not filed it on GitHub as I am hoping we can make progress on this issue in the forums to refine the report prior to filing.

Output of restic version

$ restic version
restic 0.9.5 compiled with go1.12.2 on linux/amd64

This is the version that comes with Fedora 30.

How did you run restic exactly?

$ restic -r b2:my-restic-bucket init
enter password for new repository:
enter password again:
created restic repository 13da561578 at b2:my-restic-bucket

Please note that knowledge of your password is required to access
the repository. Losing your password means that your data is
irrecoverably lost.

$ restic -r b2:my-restic-bucket -p ~/.restic -o b2.connections=10 backup ~
repository 13da5615 opened successfully, password is correct
created new cache in /root/.cache/restic

Files: 1934190 new, 0 changed, 0 unmodified
Dirs: 1 new, 0 changed, 0 unmodified
Added to the repo: 2.144 TiB

processed 1934190 files, 2.477 TiB in 43:29:57
snapshot 52f903b4 saved

2 days later I had backed up 458,364 files and 2,359.4 GB to Backblaze B2.
There were no errors reported in the initial backup.

$ restic -r b2:my-restic-bucket -p ~/.restic -o b2.connections=10 check --read-data
using temporary cache in /tmp/restic-check-cache-698511414
repository 13da5615 opened successfully, password is correct
created new cache in /tmp/restic-check-cache-698511414
create exclusive lock for repository
load indexes
check all packs
check snapshots, trees and blobs
read all data
Pack ID does not match, want 36a626a2, got b08dd8bb
Pack ID does not match, want 8d421b9d, got c9fe56bf
Pack ID does not match, want be85dd1d, got 847351d1
Pack ID does not match, want ee5dc22c, got 7050fd53
Pack ID does not match, want 8ac296ae, got 9e0d3310
Pack ID does not match, want bd60a62d, got 4b259d73
Pack ID does not match, want 1026ab26, got 926b0566
Pack ID does not match, want f1b9b1d2, got 5219c40b
Pack ID does not match, want 58c24569, got 8cc8403c
Pack ID does not match, want 5b56c26c, got 5fdb425d
Pack ID does not match, want 4dc20920, got 17a13414
Pack ID does not match, want 81a70ddd, got b33fbb92
Pack ID does not match, want c457a6ae, got aeaa2612
Pack ID does not match, want 30786f5f, got 8d8cd854
Pack ID does not match, want 6b1607a9, got 54278efd
Pack ID does not match, want e73b7b6a, got 0ec9924c
Pack ID does not match, want a4a78195, got 3ad461ce
Pack ID does not match, want 0ea7cec8, got 68bb406a
Pack ID does not match, want 73eb318b, got fb29310c
Pack ID does not match, want 4d926e4a, got 4a09cdb2
Pack ID does not match, want 88fa76ca, got daa53b46
Pack ID does not match, want 6b46dc25, got 397e8b14
Pack ID does not match, want 5e3a733c, got e1461564
Pack ID does not match, want d902b416, got a8ca6e9e
Pack ID does not match, want 8de6582c, got 4754c3dc
Pack ID does not match, want 8d28ce5a, got b14ec530
Pack ID does not match, want eed1eae3, got a627235f
Pack ID does not match, want 712ff936, got f65e4d3a
Pack ID does not match, want dd6d86fa, got a866e551
Pack ID does not match, want 73fcb184, got 63fe324d
Pack ID does not match, want 44f41017, got d4ceb40d
Pack ID does not match, want 1adc77d0, got 334271a5
Pack ID does not match, want 8f07123a, got 68cef679
Pack ID does not match, want 259e9e9d, got 91d3afa8
Pack ID does not match, want e30f8931, got 875c010b
Pack ID does not match, want 15ddf41b, got 7e39f11d
Pack ID does not match, want b5c51a03, got 54390591
Pack ID does not match, want 615ff026, got b62f0632
Pack ID does not match, want c51e8263, got 0d4c54aa
Pack ID does not match, want e26bd981, got d365c9a3
pack 82b0562f contains 1 errors: [Blob ID does not match, want f07e8c4d, got cb08959c]
Pack ID does not match, want 720b8ffe, got f11c731f
Pack ID does not match, want 10283f77, got 0069d362
Pack ID does not match, want 17587d23, got cd4316d1
Pack ID does not match, want b86c1dd5, got f4ac96d1
Pack ID does not match, want 75654a01, got 7f878bed
Pack ID does not match, want ff79e9a7, got 41d3e1bb
Pack ID does not match, want 056dd30f, got 0dce4145
Pack ID does not match, want eb7b6470, got 4c55aa9f
Pack ID does not match, want 2e0b13c0, got ee41263c
Pack ID does not match, want e4876892, got a80a4098
Pack ID does not match, want 75eff36e, got 14ed7283
Pack ID does not match, want c8a6705e, got ccad2b8c
Pack ID does not match, want 27db701d, got 70ccf954
Pack ID does not match, want 012b3fc0, got fbb58f71
Pack ID does not match, want fd26fedf, got 06ac52e0
Pack ID does not match, want 774b94fa, got 07860142
Pack ID does not match, want 80f506f1, got 37ad1bf5
Pack ID does not match, want 5ec12371, got 10016d33
Pack ID does not match, want ba660cc7, got 231e99b0
Pack ID does not match, want 24dd2244, got 8026ffc5
Pack ID does not match, want 41217e48, got 9ce1a7c0
Pack ID does not match, want 13fa1c24, got 96fc8300
Pack ID does not match, want 2d6dfd31, got abded910
Pack ID does not match, want 1552ee83, got d9860217
Pack ID does not match, want 46417dfe, got 26477688
Pack ID does not match, want 8ede1c33, got eab84452
Pack ID does not match, want 6a10360f, got b06a9d94
Pack ID does not match, want 4a2fb6da, got ea55b897
Pack ID does not match, want 83e6385f, got 3a2963ca
Pack ID does not match, want 8c96d754, got 340624a0
Pack ID does not match, want 41c69fb1, got 1922f362
Pack ID does not match, want 691ad997, got 1d5e2912
Pack ID does not match, want 7244b101, got 34a89296
Pack ID does not match, want 20445a75, got 190b910b
Pack ID does not match, want d82ad860, got 226a5751
Pack ID does not match, want a4f8f213, got e663623f
Pack ID does not match, want e0400c6e, got 03508123
Pack ID does not match, want 97e831dc, got 276c4e9b
Pack ID does not match, want 53a91a6c, got 9c1ee7ea
[31:32:59] 100.00% 457558 / 457558 items
duration: 31:32:59
Fatal: repository contains errors

What backend/server/service did you use to store the repository?

Backblaze B2

Expected behavior

No errors while checking a newly created backup.

Actual behavior

There were two different kinds of errors in the backup.

Pack ID does not match, want 8ac296ae, got 9e0d3310
[…]
pack 82b0562f contains 1 errors: [Blob ID does not match, want f07e8c4d, got cb08959c]
[…]

Steps to reproduce the behavior

Unfortunately nothing concrete. The repo is large so re-creating costs money and takes time.

Do you have any idea what may have caused this?

I have no reason to suspect that my RAID1 array has issues or that my machine has RAM or CPU issues. I have not noticed problems with the data on my disks, machine runs fine and is healthy.

I rather suspect that there were issues in the transmission of the data to Backblaze. I see some errors in the syslog:

Jun 06 21:33:23 foo.example.net kernel: IPv4 PACKET DROP: IN=enp0s25 OUT= MAC=00:1c:c0:a4:02:8f:b4:fb:e4:86:31:8d:08:00 SRC=206.190.209.238 DST=192.168.1.1 LEN=52 TOS=0x00 PREC=0x00 TTL=51 ID=10368 DF PROTO=TCP SPT=443 DPT=55424 WINDOW=1821 RES=0x00 ACK URGP=0

This is a packet dropped by the iptables firewall because for some reason the kernel did not see the packet as matching this rule:
Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
num pkts bytes target prot opt in out source destination
5 2125M 2513G ACCEPT all -- * * 0.0.0.0/0 0.0.0.0/0 state RELATED,ESTABLISHED

It is a Backblaze IP, so relevant to this backup session:

$ dig -x 206.190.209.238 +short
pod-000-1113-05.backblaze.com.

I see 41 packets dropped over the duration of the backup to that host. These dropped happened in bursts of several packets (1, 32, 3, 5) respectively. Including other Backblaze hosts, there were 118 packets dropped. Considering the number of packets needed to transfer 2,359.4GB of data this is infinitesimal.

Regardless of why the kernel dropped the packets, I would not expect this to compromise the integrity of the restic archive.

Do you have an idea how to solve the issue?

I am wondering if there is a way to recreate just the packs that are known to be incorrect.
I still have the original data so it should be possible to selectively recreate these.

Did restic help you or made you happy in any way?

I love the helpful attitude in the restic community.

I would not either… restic will retry if there was a network error.

I should note that I see similar dropped packets logged on my edge router. I did some research on this awhile back and the best conclusion I was able to come to was that the errors were spurious; it looked like the remote system ACKing a FIN+ACK on my side. For some reason, the local firewall thought the connection had been completely shut down but the remote host did not, and the remote host was simply trying to complete the last step of the FIN, FIN+ACK, ACK dance. The upshot of this all was that the connection was closing anyway and the dropped packet ultimately did not cause any kind of damage to an ongoing connection.

Welcome to the forum @jlduprat, and thanks for the comprehensive report!

While I don’t have the time right now to dig into this issue, I’m certain that “dropped packets” as the error source is very unlikely:

  • When a packet is dropped, the TCP stack will make sure it is resent
  • The connections to B2 are always HTTPS, which means that the TLS layer ensures data integrity and “data is missing in the middle of a connection” is a fatal error in TLS so the connection would have been aborted
  • When saving data to B2 does not work, restic just retries the request until it succeeds or the time is up

I can understand that. Please be aware that using restic has revealed a number of hardware issues over the last five years our users were not aware of before. Do you have the option to run memtest on the machine for a few hours at least?

You’re also right about the two kinds of errors:

This is an error on the outermost level: restic requested a file from B2 for which it knows the SHA256 hash of the contents, but got something with a different hash back. The data might have been modified at rest, during transit, in memory of the machine, or even at backup time before it could be saved to B2.

This one is much different: restic requested the pack 82b0562f and the hash of the contents matched the file name, so the data was not modified in transit or at rest. But: a part of the (encrypted) data has been modified, which could only have happened between restic encrypting the data (so called blob, pack files contain one or more of these) and saving the data to a temp file before uploading to B2. In theory, it could also have happened during check, but you can easily test that for yourself:

  • Find the complete filename for the pack: restic list packs | grep '^82b0562f'
  • Download the pack and check its hash :restic cat pack <ID> | sha256sum

Otherwise, I can only imagine that this happened within restic, at backup. This leaves two possibilities:

  1. A bug in restic or the Go compiler/runtime
  2. A hardware issue on the machine running restic (RAM, CPU, storage?)

While I cannot rule out any of those issues (and there may be a bug), I think the second one is more likely because many other people running restic without data integrity issues with even more data (up to several tens of terabyte I was told) and we’ve seen similar issues in the past which were indeed caused by hardware issues.

So, would you mind running memtest and reporting back? :slight_smile:

1 Like

I agree with your assessment of the TCP/TLS connection. The dropped packets should not have resulted in a broken archive.

So, would you mind running memtest and reporting back? :slight_smile:

Yes will do and I will report.

In the meantime, is there a way to recreate the broken pack files?

JL

1 Like

Let’s not forget that, before golang and restic even sees any errors, the OS TCP stack would already have retried those missing packets many times over… this IMO makes the “packet loss damaged my repo” hypothesis even less probable.

In my experience, memtest does not catch a lot of RAM errors. What I do here is to run mprime86 in torture mode, in parallell with dledford-memtest for at least 24h (ideally 48h) uninterrupted.

@jlduprat, does the machine have ECC RAM?

– Durval.

1 Like

I was not able to run a memtest for a variety of reasons. I elected to replace all the RAM on the machine, it was between 5 and 10 years old so failure was certainly a possibility. Having done so, I ran a clean backup, followed by a check --read-data. That took another 5 days.

The only surprise in the check was the following message:

pack ecf52b1c contained in several indexes: {944ba2f0 ef473b3e}
This is non-critical, you can run `restic rebuild-index’ to correct this

I can report that the issue was indeed due to faulty RAM. I am not sure why the pack error came up the second time. Should I suspect further HW issues?

JL

3 Likes

Awesome, I’m very glad you found this before it starts corrupting other data (which you may not notice until it’s too late)!

There were no other errors? Especially not the along the lines of Pack ID does not match or Blob ID does not match? Then indeed the issue was probably caused by the old memory. :slight_smile:

Don’t worry too much about that: There’s a bug in restic somewhere (I haven’t figured out where yet) that sometimes meta-information (what data is stored where) is saved in two or even more indexes. It’s non-critical, so we added the hint for rebuild-index. :slight_smile: