Repository broken after interrupted network?

atdotcom · January 11, 2024, 9:37am

I am doing backups to a hetzner storagebox using restic and rclone over ssh.

Recently, I got the error shown below.
I think this happended due to the network connectivity being interrupted.

However, I am concerned about it saying Fatal: repository contains errors.

Does this mean that the repository is positively broken? And I need to delete it and recreate it?
Or is this just a side effect of the connection being cut?

Can I just run a restic check on the repository to ensure everything is OK?

(this is a multi TB repo - it will take several days to recreate)

rclone: packet_write_wait: Connection to x.x.x.x port 23: Broken pipe
Load(<data/423c6f0009>, 17191814, 0) returned error, retrying after 636.224214ms: ReadFull: unexpected EOF
Load(<data/f27287cefe>, 16895827, 0) returned error, retrying after 359.525712ms: ReadFull: unexpected EOF
Load(<data/32b95b43e3>, 18275105, 0) returned error, retrying after 406.479426ms: ReadFull: unexpected EOF
Load(<data/62b82e4d4b>, 17194203, 0) returned error, retrying after 537.757501ms: ReadFull: unexpected EOF
Load(<data/22bc4a81a6>, 17595060, 0) returned error, retrying after 366.134015ms: ReadFull: unexpected EOF
Save(<lock/d76b1f8eac>) returned error, retrying after 641.01034ms: Post "http://localhost/locks/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx": unexpected EOF
pack xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx failed to download: StreamPack: rclone stdio connection already closed
pack xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx failed to download: StreamPack: rclone stdio connection already closed
pack xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx failed to download: StreamPack: rclone stdio connection already closed
pack xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx failed to download: StreamPack: rclone stdio connection already closed
[...]
pack xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx failed to download: StreamPack: rclone stdio connection already closed
error while unlocking: rclone stdio connection already closedFatal: repository contains errors

kapitainsky · January 11, 2024, 9:57am

In theory network problems should not lead to total repo corruption. But theory and real life often are not the same.

You actually should. Have a read of documentation and run:

for high level check - restic check
to not leave any bit of data unchecked - restic check --read-data

Good practice is to run checks periodically. I run restic check every time after backup/forget/prune and in addition restic check --read-data-subset 5% daily checking random part of my repo.

atdotcom · January 11, 2024, 10:22am

Thanks. I already run restic check periodically everytime I do a prune. I do not run it after each backup - should I?

I have not run a full “restic check --read-data” yet as this will take weeks…
But if I run “restic check --read-data” can I be sure that repository does not contain any errors?

kapitainsky · January 11, 2024, 10:29am

You do not have to. In my case I run backups daily at late night so I do not mind to run full sequence - restic backup → restic forget/prune → restic check and send an email if any problems. As I mentioned I also run restic check --read-data-subset 5% - percentage is based on my repo size/connection speed. Roughly I set it so it finishes in about 1h.

I can see that for large repos it is not straightforward indeed.

Yes. This is actually only way to make sure that everything is in right order. You could maybe stagger it across many days:

restic check --read-data-subset 1/100
restic check --read-data-subset 2/100
…
restic check --read-data-subset 100/100

I would adjust it to let every part to run all night for example.

To get n part of n/t daily sequence you can use $(( $(date +%s) / 86400 % $t +1)) in bash.

alexweiss · January 11, 2024, 12:04pm

@atdotcom All commands are designed such that an abort at any time won’t lead to a broken repository. Just run restic check (without --read-data) will ensure this. --read-data is only needed to ensure the integrity of the data within the backup. But an aborted command with wrong handling would lead to something more broken like missing references - these are all things a simle check command is able to detect.

atdotcom · January 11, 2024, 12:19pm

OK. So when it says Fatal: repository contains errors, it does not mean that the repository literally contains errors. It is just that whatever operation it was doing failed and that there might be some metadata errors that are detected (and fixed?) by running restic check?

I have run restic check and no errors were found…

atdotcom · January 11, 2024, 12:21pm

Will this work even if the repo grows during those days/checks?

kapitainsky · January 11, 2024, 12:23pm

Yes. There is always n/t part:) Will it cover then 100% of repo? nope. There might be some “missed” parts. It is trade-off vs running full --read-data mode.

alexweiss · January 11, 2024, 12:34pm

The fact that restic requested data from the repository which the backend wasn’t able to deliver did indicate to restic that there are problems with the repository. However, in combination with the fact that you had connection issues, this can be also a reason for not being able to deliver what was requested. But, from restics point of view, the (accessible) repsository was in a corrpted state.

As you successfully ran restic check, you verified that this was just a temporary state due to the interrupted network.

atdotcom · January 11, 2024, 12:52pm

OK. Thanks for clearing this up!