Restic copy + --skip-missing

akrabu · November 7, 2022, 11:33pm

This is both an idea and something I’m attempting at the moment…

I had a Restic database stored in a Dropbox folder - but I let Dropbox handle the syncing. That was, to say the least, a mistake - but rclone was getting limited to about 1-2MB/s and it was unusable through that backend. So I thought I’d be “clever” and store the Dropbox on an external volume, and let Dropbox the app handle the syncing. It worked… until it didn’t. I moved the external volume to a new computer, and re-setup Dropbox. It probably would have worked if they were already in sync. They weren’t, though. The cloud copy was out of date. Well, when I re-setup Dropbox, it took the cloud data as the most recent, and started deleting things but also adding the new snapshots. Then it ran out of space. So now I have a very mangled database. Luckily it’s all backups of backups, and nothing important was lost. This was always just a failsafe. All my users have their own Time Machine volumes.

Anyway,I created a new database on SharePoint and have been using that instead. I had given up on the Dropbox repository. But I just had an idea. What would happen if I did a “copy” operation from the broken repository to the new?

Well pretty quickly, I started getting a ton of these:

Load(<data/2ae013c2b9>, 0, 0) returned error, retrying after 728.983862ms: <data/2ae013c2b9> does not exist

My idea is… what if there was a --skip-missing or --skip-damaged switch, that upon “restic copy” encountering missing data, it would just skip that snapshot and try copying the next? That way if there are any salvageable snapshots, you’d be able to recover them? Just a thought!

akrabu · November 8, 2022, 4:58am

So trying “restic copy” actually caused restic to full-on crash.

I think I’ll just try copying each snapshot individually. That way if it crashes it’ll keep going on the rest. I did manage to recover a few this way already, though!

akrabu · November 8, 2022, 4:50pm

Shew, is this memory usage typical??

MichaelEischer · November 9, 2022, 8:46pm

That indeed appear to be a bit excessive. copy has to load the index for both repositories, which should account for the majority of the memory usage.

To transfer the salvageable parts of the repository, running rebuild-index, then using the repair command from Add repair command by aawsome · Pull Request #2876 · restic/restic · GitHub and finally running copy would probably be more reasonable.

akrabu · November 17, 2022, 3:04am

So I originally had 3.4TB of raw data. Since Dropbox can’t “roll back” to a date, I was forced to restore ALL deleted files. This ballooned my repository to 6.7TB. I then couldn’t operate on it in the cloud, so I had to sync it down to an 8TB RAID-5 set. With that complete, I was able to rebuild-index and run the repair. It worked! I only lost 5 out of 333 snapshots.

But, speaking of “a bit excessive”… how’s this prune run look to you? haha

Apparently it was serious about the 16777215.999 TiB - at least it filled up the 8TB RAID-5 array. I deleted the “repaired” snapshots, correctly suspecting them as the issue, then ran prune with --max-repack-size 50M. That got things down to a more manageable 5TB.

MichaelEischer · November 17, 2022, 8:23pm

That prune size looks familiar ^^ . That was the following issue:

github.com/restic/restic

`prune`stats inconsistent / inaccurate

opened 09:54PM - 06 Sep 22 UTC

closed 07:06PM - 12 Nov 22 UTC

bigCrash

type: bug category: prune state: need investigating

# What's the version of restic you used? Please include the output of restic ver…sion in your bug report. `restic 0.14.0 compiled with go1.19 on linux/amd64` # What commands did you execute to get to where the bug occurred? The starting point is a ~ 3,2 TiB repository pre 0.14.0. I then: - updated restic and migrated the repository to v2 - ran: `restic prune --compression max --pack-size 64 --repack-uncompressed --repack-small --no-cache` - This run got OOM killed at about 50-60%; I upgraded memory and continued - **ran: `restic prune --compression max --pack-size 64 --repack-uncompressed --repack-small --no-cache --dry-run`** - **ran: `restic prune --compression max --pack-size 64 --repack-uncompressed --repack-small --no-cache`** I do not have any logs of the first restic prune run. # What did you expect? I expected that prune dry-run and wet run behave the same (deterministic) regarding the stats they show. And I expect the numbers to be correct in any case. # What happened instead? **Log of 2nd prune:** ``` loading indexes... loading all snapshots... finding data that is still in use for 842 snapshots [38:23] 100.00% 842 / 842 snapshots searching used packs... collecting packs for deletion and repacking [7:12] 100.00% 736014 / 736014 packs processed to repack: **15759546 blobs / 3.151 TiB** this removes: 1146545 blobs / 93.147 GiB to delete: **3723161 blobs / 1.768 TiB** total prune: 4869706 blobs / 1.859 TiB remaining: 14613001 blobs / 3.060 TiB unused size after prune: **16777215.991 TiB** (100.00% of remaining size) ``` **Log of 3rd prune:** ``` loading indexes... loading all snapshots... finding data that is still in use for 842 snapshots [35:23] 100.00% 842 / 842 snapshots searching used packs... collecting packs for deletion and repacking [7:44] 100.00% 736014 / 736014 packs processed to repack: **15828929 blobs / 3.154 TiB** this removes: 1215928 blobs / 97.818 GiB to delete: **3653778 blobs / 1.765 TiB** total prune: 4869706 blobs / 1.861 TiB remaining: 14613001 blobs / 3.058 TiB unused size after prune: **16777215.989 TiB** (100.00% of remaining size) deleting unreferenced packs [23:54] 100.00% 88673 / 88673 files deleted repacking packs [11:53:12] 12.91% 86136 / 667243 packs repacked ``` **There is a discrepancy between dry-run and wet-run in terms of blobs to repack and blobs to delete. Dry-run shows the respective blocks as `to delete` whereas wet-run shows them as `to repack`**. Additonally, the `unused size after prune` is definitely inaccurate. # Are you aware of a way to reproduce the bug? No.

akrabu · November 19, 2022, 1:06am

Hmm I only used the PR to repair. I’m using 0.14.0 to prune. I’m guessing it hasn’t made it to stable yet though. I’ll try one of the beta builds if I run into any actual trouble with it. Thanks!