How to handle missing data

Depau · June 2, 2020, 7:34pm

Hello,
I completely messed up both my filesystem (containing both the backups and some of the backed up data), but also the mirrored backups.

What happened is that before the disaster occurred, I interrupted the transfer mid-sync (rclone sync --delete-before). I would have resumed it at a later time, but oops. So the remote repos have most data from previous syncs, but they’re missing any new files + any files that were deleted when creating the latest snapshots (because of --delete-before).

What I’m trying to do now is recover as much data as possible. I downloaded all the repositories from my mirrors to a local filesystem, and now my idea is that I can restore the last good snapshot for each repository and then try to copy as much data as possible from the following bad repositories.

The problem is that most restic commands will do this:

Load(<data/b7106512f7>, 0, 0) returned error, retrying after 11.222385136s: open repo/data/b7/b7106512f7e80c2b66c273d8de890fe1087f15fb12c8e7762b08cf4b5920fcd9: no such file or directory
Load(<data/c01717c427>, 0, 0) returned error, retrying after 1.807980427s: open repo/data/c0/c01717c4277e85b1f48feb3dbdeba81569cdfc591851e408941936a325c5d308: no such file or directory
Load(<data/c01717c427>, 0, 0) returned error, retrying after 3.343745266s: open repo/data/c0/c01717c4277e85b1f48feb3dbdeba81569cdfc591851e408941936a325c5d308: no such file or directory
Load(<data/c01717c427>, 0, 0) returned error, retrying after 4.659096946s: open repo/data/c0/c01717c4277e85b1f48feb3dbdeba81569cdfc591851e408941936a325c5d308: no such file or directory

There’s no need to retry. The files are lost, forever, they’re not gonna show up if restic asks enough times.

For most of my repos, I can just mount them, an old snap with cp -a and if I see it’s retrying, try with a smaller set of data. However, that’s way too inconvenient.

I’ve ran rebuild-index on a copy of the repository and now I’m running check on it. If that works I’ll update this post, hoping it might be helpful to somebody who experiences the same issue.

But in the meantime I’d like to see if anybody has any suggestions on good ways to handle this kind of situation.

Depau · June 2, 2020, 7:38pm

Actually, I have another question:

Since some of the repositories are for data that I have not lost, I was wondering what could be the best way to clean them up too and prepare them for new snapshots with good data, while also keeping as much old snapshots as possible.

Thank you

Depau · June 2, 2020, 10:16pm

rebuild-index actually did help on one repository, the most damaged one. It looks like it stopped with the retries on missing file, and I was able to rsync out of it most data. Luckily all the lost files were in ~/.cache.

Wish me good luck for when I restore my postgres db

I also read this comment, I’ll make sure I mirror my backups this way.

dionorgua · June 3, 2020, 9:09am

Am I right that you have two mirrors of exactly same repository (copied one to another using rclone)?
Firstly if accessing them directly by restic, it’s better to always remove local cache ($HOME/.cache/restic) because cache directory is named using repository ID which will be same.

If it’s two mirrors of same repo, followed scenario SHOULD work (in any case it’s better to do on copy of repostiories):

Just ‘merge’ these repositories by copying all files from one repo to another. data to data, snapshots to snapshots, etc. Files with same name should have same content. (Not using Finder or similar tools that just ‘replaces’ one directory with another)
Remove restic cache directory and run restic rebuild-index on such ‘merged’ repo.
Try to do restic check then restore data.
If restic check still complains about missing blobs/trees, it makes sense to backup remaining part of source data again with restic backup --force (even if it’s placed in different directory now). This will add missing blobs.
If restic check passes after this, prune should be able to delete duplicate data.

This will ONLY work if repositories are just ‘mirrors’ of same repo (basically config file should be EXACTLY same).

Depau · June 3, 2020, 12:29pm

Hi, thanks for the suggestion.
I actually managed to solve all my issues eventually. I didn’t know the cache could cause issues, I actually left it there on purpose since it hoping it cached some data that went missing in the repos.

I’ll explain the set up a little better:

I (used to) have the main restic repos on a local filesystem on a hard disk. This is what I would normally use for daily operations such as normal backups, since the disk is faster.
Then, every night I had a cronjob that would mirror them (with rclone sync) to two OneDrive accounts.
I lost all the content of the hard disk that had both some of the data and all of the backups, and some of the data on both OneDrives because of the interrupted transfer.
- The disk is good, it’s just that I like to live on the edge so everything was stored on a filesystem cached with bcache with writeback caching for performance. At some point it decided to detach the cache and the filesystem got badly corrupted.

Here’s what I did eventually to solve the issue:

I downloaded the repos from OneDrive to a local fs. First I got the first one with rclone sync, then on top of it I rclone copyed the other one on top of it to get as much data as possible.
I ran check --read-data on every repo, noted all the repos that failed and the reasons.
On all the repos that failed, I ran rebuild-index, then check again (without --read-data)
I noted all the trees that were reported missing, so I ran find --tree TREE to find which snapshot it belonged to
I deleted the affected snapshots
Ran check again, then prune.

This was effective on most repos. I was actually super lucky because it was effective on all repos that contained data I had lost.

I had at least 3 repos, however, that were completely messed up. check would print out ~100 lines of errors. Those I just removed/renamed, created a new repo and moved on with life. That only happened on repos for which I didn’t also lose the data, so I just lost the old snapshots.

Now I’m in the process of re-running check --read-data on all of them. Once that’s done I’ll tag the last snapshot so I know it’s the last one before the disaster, and run a new backup.

I already ordered a new hard disk, so this is what I will do from now on:

Backups will normally go to the same machine
Another machine, with the new hard disk, will pull the repos nightly
- They will be stored on btrfs, and before pulling them, I will create a snapshot.
- I will then run integrity checks on the updated repos and, if they’re okay, delete the old snapshot.
Once I’m sure the data is safe, I will push it to OneDrive from there - still with --delete-first because otherwise it fills up my OneDrive account and I’ll have to contact the AD admin to get it fixed.

dionorgua · June 3, 2020, 12:41pm

Thanks for sharing this!

I’m using similar scheme with local NAS with dedicated HDD for backups + remote mirrror. So your experience is useful to understand weak points and be ready

It sounds that you are luky :).

Btw why two OneDrive accounts? I understand when somebody mirrors to two different clouds for extra copy.

Depau · June 3, 2020, 3:36pm

Two accounts just because I have them and I don’t pay for them My previous company had Google Drive instead, which I would recommend since it’s unlimited (vs. 1TB or 5TB) and it has higher bandwidth limits.

To those wanting to try OneDrive for Business, please beware that if the AD admin set up retention policy, if you fill it up you can’t delete any files. The official Microsoft solution is to ask the admin to upgrade your account to 5TB, delete the files and then shrink it back. That happened to me and that’s why I mirror with --delete-before. Now that I found out that restic does not modify existing files, I’ll also find a way to sneak in --immutable, since modified files also count towards the storage retention policy reserve.

My main problem with this setup, however, is that rclone uses a shitload of CPU to perform uploads, and I use the backup machine for other stuff too (it’s an old mini desktop that I use as my home server). That’s is why I had killed the mirrors temporarily before the disaster occurred.

So the idea is that I’ll keep one additional local mirror on a hard disk that’s attached to a SBC, it will likely be a Rock64 or a HardRock64 if they release it soon enough. That board is gonna pull the repos from the main server over some CPU-friendly protocol, then it’s gonna upload it to the cloud without killing the main server’s performance in the process.