Partial repository recovery with truncated/corrupted packs

I have a backup repository that is slightly damaged that I would like to recover, rather than start over from scratch. The backup source is a remote machine with almost 2TB of data being backed up at a max rate of 1MB/s (2/3 of the max outgoing bandwidth) so the initial backup took more than a month. (The data doesn’t change much, so incremental backups are quick.)

The target is minio on a server local to me. The server shut down uncleanly, and now a restic check --read-data (running directly on the server) reports a few packs which failed to download: StreamPack: ReadFull: unexpected EOF. I have addressed the underlying problem on the server, and now want to repair the repository.

The backup source does not need to be restored right now. I just want the next snapshot to fill in the now-missing files so that I again have a consistent repository.

I don’t see an option to have restic automatically delete damaged packs as part of either restic check or restic repair — it looks like it can remove records of entirely missing packs, but I have to find and delete damaged packs manually from the underlying storage for restic repair index and restic repair snapshots --forget to give me a consistent repository to which the next backup snapshot should fill in all the current files.

While I’ve been manually deleting truncated packs so far, is there an option I’ve missed that would have allowed a one-shot “delete all corrupted packs” to speed up this kind of recovery?

Thanks!

In your case the easiest way to detect all truncated pack files is to run restic check (using a recent version), which soon after startup should report the IDs of all truncated pack files.

The other option is to run sha256sum on the S3 repository. Mount it e.g. using rclone and then use shasum -a 256 to verify that each file’s sha256 hash matches its filename. For truncated files there’ll be a mismatch.

That’s orthogonal to my question. You can see that I was already running restic check in my question. The question was not about detecting corrupt packs. It’s about automatically deleting them.

I had to read the output from restic check and manually delete the corrupt packs before doing restic repair to recover them.

I have successfully recovered, so this question is no longer for my benefit; it’s for the next person’s benefit.

It would have been more convenient if there had been a restic check --read-data --delete-corrupt-packs or something like that available to avoid the need to manually click through the minio administrative interface to find and delete the corrupt packs there.

We’ll add more automation over time to make repairing a repository easier. From what I remember, truncated files aren’t particularly common and it’s possible to manually repair the repository. Thus, there hasn’t been a pressing need to fully automate this step.

That said, the problem with truncated files on Minio shouldn’t even exist. restic expects its backends to either store a file durably&completely or not at all. Everything else will sooner or later cause problems.

Trust me, I didn’t like it any more than restic did. I apologize for having a BIOS version installed that caused the storage server to freeze. :stuck_out_tongue:

I’m very happy that I was able to restic check and resolve the problem. I was amazed by how much restic repair could fix up old snapshots where all the affected files that still mattered were still in place. This was awesome. I had expected that old snapshots would just be entirely broken. Instead, I’m i good shape.

Zero-length files are, generally speaking, not an unusual symptom of system failures during heavy file-creation I/O, as might happen during a restic backup. But I’m guessing that I’m relatively unusual in having a large backup over a slow link to personally-managed storage, where there was meaningful value in recovering vs. starting over.

My expectation would have been that Minio is able to guarantee that a file cannot be lost once the upload has been completed successfully (but maybe the BIOS broke some of the underlying layers). Somewhat recent versions of restic and rest-server always write files to the repository in the following way: write to a temporary file, fsync, rename and fsync the directory. This guarantees that the file is really written to disk before confirming that an upload was successful. I’d expect that Minio does something similar, but maybe the fsync part didn’t work…