Data seems to be missing after repack-uncompressed

meise · June 16, 2022, 10:21am

Dear restic forum,

I migrated one of my restic repos [3] to new repository format v2 and did multiple prune repack-uncompressed with max-repack-size runs [1]. Last run seems to looked like it completed successfully, however integrity check fails now with »Data seems to be missing« message. And indeed, repo seams to be 160 GB smaller to expected size.

[1] restic prune --cleanup-cache --repack-uncompressed --max-repack-size 150G

Unfortunately I don’t have the log output from last prune repack-uncompressed run. Here is the current prune and check output [2].

I would like to ask for help to
a.) reproduce what happened and
b.) rebuild repo integrity with still existing data if possible.

[2] Prune and check log (~200 MB uncompressed)
[3] Backend: sftp (hetzner storage box)
Repo size: ~730 GB with 109 Snapshots

Greetings
meise

meise · June 16, 2022, 10:57am

Stupid me runs another repack-uncompressed cron job in parallel on a different machine:

Jun 16 04:32:03 fungus backup.sh[22675]: Create backup.
Jun 16 04:32:07 fungus backup.sh[22684]: unable to create lock in backend: repository is already locked exclusively by PID 200026 on oglarun by root (UID 0, GID 0)
Jun 16 04:32:07 fungus backup.sh[22684]: lock was created at 2022-06-16 04:27:08 (4m59.087478651s ago)
Jun 16 04:32:07 fungus backup.sh[22684]: storage ID 25231edf
Jun 16 04:32:07 fungus backup.sh[22684]: the `unlock` command can be used to remove stale locks
Jun 16 04:32:07 fungus backup.sh[22675]: Created backup. Exit status 1.

Jun 16 04:32:07 fungus backup.sh[22675]: Check backup 132/256.
Jun 16 04:32:10 fungus backup.sh[22702]: create exclusive lock for repository
Jun 16 04:32:10 fungus backup.sh[22702]: unable to create lock in backend: repository is already locked exclusively by PID 200026 on oglarun by root (UID 0, GID 0)
Jun 16 04:32:10 fungus backup.sh[22702]: lock was created at 2022-06-16 04:27:08 (5m1.77231585s ago)
Jun 16 04:32:10 fungus backup.sh[22702]: storage ID 25231edf
Jun 16 04:32:10 fungus backup.sh[22702]: the `unlock` command can be used to remove stale locks

Jun 16 04:32:10 fungus backup.sh[22675]: Checked backup. Exit status 1.
Jun 16 04:32:10 fungus backup.sh[22675]: Repack uncompressed data .
Jun 16 04:32:13 fungus backup.sh[22712]: loading indexes...
Jun 16 04:32:21 fungus backup.sh[22712]: loading all snapshots...
Jun 16 04:32:22 fungus backup.sh[22712]: finding data that is still in use for 109 snapshots
Jun 16 04:33:46 fungus backup.sh[22712]: [1:24] 100.00%  109 / 109 snapshots
Jun 16 04:33:46 fungus backup.sh[22712]: searching used packs...
Jun 16 04:33:57 fungus backup.sh[22712]: collecting packs for deletion and repacking
Jun 16 04:39:01 fungus backup.sh[22712]: [5:04] 100.00%  154249 / 154249 packs processed
Jun 16 04:39:01 fungus backup.sh[22712]: to repack:         57586 blobs / 20.000 GiB
Jun 16 04:39:01 fungus backup.sh[22712]: this removes:          0 blobs / 0 B
Jun 16 04:39:01 fungus backup.sh[22712]: to delete:             0 blobs / 167.346 GiB
Jun 16 04:39:01 fungus backup.sh[22712]: total prune:           0 blobs / 167.346 GiB
Jun 16 04:39:01 fungus backup.sh[22712]: remaining:       2717553 blobs / 740.914 GiB
Jun 16 04:39:01 fungus backup.sh[22712]: unused size after prune: 3.746 MiB (0.00% of remaining size)
Jun 16 04:39:01 fungus backup.sh[22712]: deleting unreferenced packs
Jun 16 05:42:48 fungus backup.sh[22712]: [1:03:46] 100.00%  34697 / 34697 files deleted
Jun 16 05:42:48 fungus backup.sh[22712]: repacking packs
Jun 16 05:59:20 fungus backup.sh[22712]: Load(<data/e0488c7d33>, 5689892, 0) returned error, retrying after 552.330144ms: file does not exist
Jun 16 05:59:24 fungus backup.sh[22712]: Load(<data/e0488c7d33>, 5689892, 0) returned error, retrying after 1.080381816s: file does not exist
Jun 16 05:59:29 fungus backup.sh[22712]: Load(<data/e0488c7d33>, 5689892, 0) returned error, retrying after 1.31013006s: file does not exist
Jun 16 05:59:32 fungus backup.sh[22712]: Load(<data/e0488c7d33>, 5689892, 0) returned error, retrying after 1.582392691s: file does not exist
Jun 16 05:59:35 fungus backup.sh[22712]: Load(<data/e0488c7d33>, 5689892, 0) returned error, retrying after 2.340488664s: file does not exist
Jun 16 05:59:37 fungus backup.sh[22712]: Load(<data/e0488c7d33>, 5689892, 0) returned error, retrying after 4.506218855s: file does not exist
Jun 16 05:59:42 fungus backup.sh[22712]: Load(<data/e0488c7d33>, 5689892, 0) returned error, retrying after 3.221479586s: file does not exist
Jun 16 05:59:47 fungus backup.sh[22712]: Load(<data/e0488c7d33>, 5689892, 0) returned error, retrying after 5.608623477s: file does not exist
Jun 16 05:59:53 fungus backup.sh[22712]: Load(<data/e0488c7d33>, 5689892, 0) returned error, retrying after 7.649837917s: file does not exist
Jun 16 06:00:01 fungus backup.sh[22712]: Load(<data/e0488c7d33>, 5689892, 0) returned error, retrying after 15.394871241s: file does not exist
Jun 16 06:00:21 fungus backup.sh[22712]: [17:33] 9.32%  390 / 4186 packs repacked
Jun 16 06:00:21 fungus backup.sh[22712]: Fatal: StreamPack: file does not exist
Jun 16 06:00:21 fungus backup.sh[22675]: Repacked backup. Exit status 1.

»[1:03:46] 100.00% 34697 / 34697 files deleted« looks not that good. Why was that repo not locked?

MichaelEischer · June 17, 2022, 3:56pm

Indeed, that is a problem. It sort of matches what is to be expect from two concurrent prune runs. While prune is working, the new data is not yet added to the index, which causes a second prune run to detect that new data as unnecessary and remove it. The first prune run will then remove the original data, which taken together causes data loss. Which parameters did the prune run on fungus use? It looks like it is set to repack only up to 20GB?

To repair the repository you can try to follow the instructions at Recover from broken pack file · Issue #828 · restic/restic · GitHub . However, the repository is currently missing lots of data, and probably every single snapshot is missing data. To just avoid uploading everything again, it would be sufficient to run rebuild-index, drop the snapshots and the create new ones. If you want to keep the snapshots as far as the data still exists, then you can try the experimental PR referenced in the issue.

Are your scripts calling unlock in some place? The prune command always creates an exclusive lock, so for some reason restic did not see the lock file of the other prune run. However, using SFTP all clients should see the same content of the lock directory. And restic checks twice that no other lock exists which should prevent quite a few possible race conditions regarding to locking. I currently have no real idea what exactly has caused the lock mechanism to fail.