Restic with rclone on unreliable connection - is my repository intact?

Hi,

I have a notoriously unreliable internet connection. At a frequency of about once a week, I experience long running restic connections being randomly interrupted.

Now, I run restic daily to do backups to a remote server. My backup script runs the following restic commands in this order:

  • restic backup
  • restic --keep-last X […] forget
  • restic prune
  • restic check
  • restic check --read-data-subset 10%

Now, as mentioned my connection is unrealiable and thus, each of these operations may be arbitrarily interrupted.

Does restic handle this gracefully?

I would assume it does.

What prompted me to post here, is that sometimes when this happens, I get messages like these in the logs:

Example 1:


  rclone: packet_write_wait: Connection to x.x.x.x port 23: Broken pipe
  Load(<data/XXXXXXXX>, 0, 0) returned error, retrying after 487.924606ms: Copy: unexpected EOF
  Load(<data/XXXXXXXX>, 0, 0) returned error, retrying after 675.519179ms: Copy: unexpected EOF
  error for tree XXXXXXXX:
  Load(<data/XXXXXXXX>, 0, 0) returned error, retrying after 521.049113ms: Copy: unexpected EOF
    ReadFull(<data/XXXXXXXX>): rclone stdio connection already closed
  Load(<data/XXXXXXXX>, 0, 0) returned error, retrying after 412.388072ms: Copy: unexpected EOF
  error for tree XXXXXXXX:
    ReadFull(<data/XXXXXXXX>): rclone stdio connection already closed
  error for tree XXXXXXXX:
    ReadFull(<data/XXXXXXXX>): rclone stdio connection already closed
  error for tree XXXXXXXX:
    ReadFull(<data/XXXXXXXX>): rclone stdio connection already closed
  error for tree XXXXXXXX:
    ReadFull(<data/XXXXXXXX>): rclone stdio connection already closed
  error for tree XXXXXXXX:
    ReadFull(<data/XXXXXXXX>): rclone stdio connection already closed
  error for tree XXXXXXXX:
    ReadFull(<data/XXXXXXXX>): rclone stdio connection already closed
  Load(<data/XXXXXXXX>, 0, 0) returned error, retrying after 749.266558ms: Copy: unexpected EOF
  error for tree XXXXXXXX:
    ReadFull(<data/XXXXXXXX>): rclone stdio connection already closed
  error for tree XXXXXXXX:
    ReadFull(<data/XXXXXXXX>): rclone stdio connection already closed
  error for tree XXXXXXXX:
    ReadFull(<data/XXXXXXXX>): rclone stdio connection already closed
  error for tree XXXXXXXX:
    ReadFull(<data/XXXXXXXX>): rclone stdio connection already closed
  error for tree XXXXXXXX:
    ReadFull(<data/XXXXXXXX>): rclone stdio connection already closed
  error for tree XXXXXXXX:
    ReadFull(<data/XXXXXXXX>): rclone stdio connection already closed
  error while unlocking: rclone stdio connection already closedFatal: repository contains errors

rclone: packet_write_wait: Connection to x.x.x.x port 23: Broken pipe
Load(<data/XXXXXXXX6e>, XXXXXXXX, 0) returned error, retrying after 706.258216ms: ReadFull: read |0: file already closed
Load(<data/XXXXXXXXaf>, XXXXXXXX, 0) returned error, retrying after 711.953067ms: ReadFull: read |0: file already closed
Load(<data/XXXXXXXXbb>, XXXXXXXX, 0) returned error, retrying after 402.503648ms: ReadFull: read |0: file already closed
Load(<data/XXXXXXXX55>, XXXXXXXX, 0) returned error, retrying after 679.61763ms: ReadFull: read |0: file already closed
Save(<lock/XXXXXXXX0f>) returned error, retrying after 450.589951ms: Post "http://localhost/locks/XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX": read |0: file already closed
Load(<data/XXXXXXXX29>, XXXXXXXX, 0) returned error, retrying after 472.250231ms: ReadFull: read |0: file already closed
pack XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX failed to download: StreamPack: rclone stdio connection already closed
pack XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX failed to download: StreamPack: rclone stdio connection already closed
pack XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX failed to download: StreamPack: rclone stdio connection already closed
[...]
pack XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX failed to download: StreamPack: rclone stdio connection already closed
pack XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX failed to download: StreamPack: rclone stdio connection already closed
pack XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX failed to download: StreamPack: rclone stdio connection already closed
pack XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX failed to download: StreamPack: rclone stdio connection already closed
pack XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX failed to download: StreamPack: rclone stdio connection already closed
error while unlocking: rclone stdio connection already closedFatal: repository contains errors

I am specifically concerned by the message “repository contains errors”.

I am not sure if this means that it actually does contain errors. Or if this message is also produced when a check fails due to the connection being interrupted?

I would like to just do a full check of all the data. However, this is not practically possible on this connection as the repository is many TB in size, and with this connection it is not possible to run a full check without some random fallout of the connection happening during the check…

So my questions boil down to:

1: Does this message really mean the repository positively contains errors - or may this message be produced just on these types of interruptions due to network issues?

2: Do you have any advice on how to handle such situations with unreliable connectivity? For example, would it be a good approach to do a full check with --read-data from another location with better connectivity? (That IS possible for me, however, that would expose the repository key to another location, and thus not something I would like to do on a regular basis if not necessary)

An important factor might be what backend you’re using. Some handle this better than others.

OK. I am using SSH and rclone.

While I can understand that different backends might have different ways of handling interruptions, I think producing the message “repository contains errors” should only come from restic in the event that restic has positively determined that the repository contains errors.
Not just that some timeout happened or the TCP connection was unexpectedly closed. In that event, I would expect restic to report a message like “unexpected connection termination” or something like that…

check always returns a Fatal: repository contains errors error if it cannot verify that the repository is intact.

In your case the errors are all variants of rclone stdio connection already closed, Copy: unexpected EOF and read |0: file already closed. All those errors are different ways of saying the the connection to rclone was interrupted. That is, the errors only say that the network connection was interrupted, but not an actual problem with the repository itself.

check deliberately also returns an error if reading from the repository failed as a backup that cannot be read may be impossible to restore if necessary. After all, if the network connection breaks during a restore then that operation will fail too. Or to put it differently, check must not claim that a repository is intact if is unable to verify that.

restic check --read-data-subset 10% performs the same checks as restic check but additionally also verify some pack files. Checking 10% of the repository each day is a lot. You might want to reduce the size of the subset a bit.

The usual setup is also to run prune only every week or less frequent. Although, the ideal interval depends on how much your data changes.

What exactly does that mean? Are you using a setup similar to Append-only backups with restic and rclone ?

I also backup some stuff to a remote server over ssh, but I do the “check” on the other side (i.e., login, run check locally, type in password when prompted).

If I had to restore from that repo on an unreliable connection I’d just use rsync to get the entire repo down and then restore. A flaky network can delay my restore but cannot totally deny it. (I do have enough disk space locally…)

Just tossing out an idea…

1 Like

This is indeed a good idea and I also planned to do something like this in the even that I have to do a full recover. But rarely do I have to do a full recovery from scratch. Most of the times it is a few critical files that need to be restored… However, I am paranoid about the repo being intact.

In your case the errors are all variants of rclone stdio connection already closed, Copy: unexpected EOF and read |0: file already closed. All those errors are different ways of saying the the connection to rclone was interrupted. That is, the errors only say that the network connection was interrupted, but not an actual problem with the repository itself.

Thanks for clearing this up. So I guess this means that there is no evidence of the repo being corrupt. Phew…

check deliberately also returns an error if reading from the repository failed as a backup that cannot be read may be impossible to restore if necessary. After all, if the network connection breaks during a restore then that operation will fail too. Or to put it differently, check must not claim that a repository is intact if is unable to verify that.

I agree 100% that check should return an error if reading the repo failed. However, I do not agree that it should return the specific error saying that the repo contains errors. Instead it should return an error saying that the check failed. I might say that the repo MAY contain errors. However, saying that the repo contains errors is factually incorrect. And I believe this is not the optimal behaviour.

So, I absolutely agree that “check must not claim that a repository is intact if it is unable to verify that”. So instead of saying “repository contains errors”, it would be better if it said something like “Fatal: check aborted unexpectedly. Could not verify that the repository is intact”

Anyway, I had another machine with a better connection to the backup host and I ran a full check --read-data from that host. The repository did not contain errors - phew :sweat_smile: