Automating repo recovery?

I have a bunch of machines backing up to a single Restic repository. Sometimes, the machine gets shutdown in the middle of writing a backup. When this occurs, the the next restic check usual fails with a bunch of unindexed blobs ending with Fatal: repository contains errors. This is always recoverable. First I keep note of the trees with blobs that are not indexed, and use restic find -tree ... to find the snapshots that refer to those trees. The I restic forget those snapshots. Finally I rebuild-index and prune.

This has always worked flawlessly, and I end up with a restored repo. On the other hand, itā€™s a lot of manual work. My question is: is there an official restic workflow for doing this? A standard script, or even a built in command?

This should not happen due to an aborted backup! Are you sure that this is the reason? The point is, as long as restic backup does not save the snapshot (which it does at the very end), check may complain about unreferenced blobs, but should not report a broken repoā€¦

No Iā€™m not sure that aborted backups are the cause. The unindexed blobs are the only symptom that restic reports. This is restic 0.11. My assumption that the aborted backups is the cause is just my guess. The OS (Ubuntu 20.04) logs no errors, smart claims that each disk in the array is fine, etc. The disk array is Linux software raid1, and mdstat reports thatā€™s up and good.

Anything you think I should check?

Iā€™m not sure itā€™s relevant, but I have a question about this: Which machine? Itā€™s ambiguous whether itā€™s one hosting the repository or the ā€˜clientā€™. (Although my guess would be the client.)

Sorry . . . yes, the clients are laptops and desktops. The server which hosts the repository is always on (and on a UPS etc.).

IIRC, if it is just unindexed packs, restic check does not print out repository contains error. Can you give the full output of one of your failing check runs (and also which options you used to run it)?

About your original question:

No there is no way to automate such things yet. The point is, if there is something going wrong, you have to expect data loss and the root cause might be failing hardware. I think both are reasons to enforce sysadmins being informed :wink:

You are right that the - still manually called - repair could be easier. There is this experimental PR:

This not only removes ā€œdefectā€ snapshots, but also creates ā€œrepairedā€ snapshots which try to salvage as much data as possible from the defect ones.

BTW: No need to run rebuild-index if removing the snapshots results in a sane check run. If check doesnā€™t report an error, prune should run and clean up everything.

1 Like

Thanks for all the info. The output of a failing restic check is at this Dropbox link.. This is the end of the output. The entire output is quite long but is just more of the same: blob not found in index, all referring to a small number of trees.

The file EXP.sif cited in the output is a rather large Singularity container (~1.5GB).

There are no other specific options (other than the RESTIC_REPOSITORY and RESTIC_PASSWORD environment variables).

Thanks for your logs. This indicates that restic was able to save a snapshot file and tree blobs but the contents the trees are referring to are not contained in the index (and maybe not saved at all).

Do you still have this ā€œdefectā€ repo? If yes (or if this occurs another time), can you please only run rebuild-index followed by another check? This tests if it is only the index files that are wrong or if the real data is also missing.

For debugging purposes in general it would also be helpful to create a backup of the index/ folder inside a repository first, before running rebuild-index. In some cases the index still contains useful information which could be dropped by rebuild-index.

The amount of missing blobs would require more than 100 lost pack files at once which sounds unlikely. A missing index file would be much more likely. In that case check should also complain about lots of pack files which are not contained in the index.

Thanks. Iā€™ll backup index/ next time; experience suggests that there will be a next time.

Do you want me to follow with a post when that happens with some particular information?

You are probably right about the index file; what would cause that?

All of this said, restic has been a welcome addition to my workflow, despite these occasional hiccups and Iā€™m grateful for all the work.

It would be interesting to know when the damaged snapshot was created, when the last successful check/prune run was. That should allow for an educated guess whether the damage is caused by a backup run itself or whether some other operation breaks things.

In case you use the rest-server, then the race condition fixed by Atomic file upload and directory sync by MichaelEischer Ā· Pull Request #142 Ā· restic/rest-server Ā· GitHub would be a possible cause. However, in that case restic must have logged a retry for the missing file. Besides that Iā€™m not aware of a way to loose uploaded files in restic 0.11.0.

This problem HAS NOT reoccurred since upgrading to 0.12.0. Previously, it was approx. monthly. Still monitoring.