Frequently corrupt snapshots, prune errors, missing clues

Hi,

after neglecting to prune and consistency-check backups for a year, I was unhappy to find out that many, if not all, server repositories have missing data.

As I tried to catch up on the forgetting and pruning, I caught error messages that somehow could have a common reason to why the corruption happened in the first place.

They go like this (prune subcommand):

Apr 04 05:30:48 myhostname resticprune[14692]: using temporary cache in /tmp/restic-check-cache-117577419
Apr 04 05:30:51 myhostname resticprune[14692]: created new cache in /tmp/restic-check-cache-117577419
Apr 04 05:30:51 myhostname resticprune[14692]: create exclusive lock for repository
Apr 04 05:30:51 myhostname resticprune[14692]: load indexes
Apr 04 05:30:58 myhostname resticprune[14692]: check all packs
Apr 04 05:57:04 myhostname resticprune[14692]: check snapshots, trees and blobs
Apr 04 05:58:17 myhostname resticprune[14692]: no errors were found
Apr 04 05:58:17 myhostname resticprune[14692]: [1:12] 100.00%  28 / 28 snapshots
Apr 04 05:58:17 myhostname resticprune[14692]: Finished checking consistency. Exit code: 0
Apr 04 05:58:17 myhostname resticprune[14692]: Removing old data...
Apr 04 05:58:19 myhostname resticprune[14692]: loading indexes...
Apr 04 05:58:21 myhostname resticprune[14692]: loading all snapshots...
Apr 04 05:58:21 myhostname resticprune[14692]: finding data that is still in use for 28 snapshots
Apr 04 05:58:26 myhostname resticprune[14692]: [0:05] 100.00%  28 / 28 snapshots
Apr 04 05:58:26 myhostname resticprune[14692]: searching used packs...
Apr 04 05:58:28 myhostname resticprune[14692]: collecting packs for deletion and repacking
Apr 04 06:27:33 myhostname resticprune[14692]: [29:05] 100.00%  28970 / 28970 packs processed
Apr 04 06:27:33 myhostname resticprune[14692]: used:            245453 blobs / 124.931 GiB
Apr 04 06:27:33 myhostname resticprune[14692]: duplicates:       10644 blobs / 5.558 GiB
Apr 04 06:27:33 myhostname resticprune[14692]: unused:           16845 blobs / 1.819 GiB
Apr 04 06:27:33 myhostname resticprune[14692]: total:           272942 blobs / 132.308 GiB
Apr 04 06:27:33 myhostname resticprune[14692]: unused size: 5.58% of total size
Apr 04 06:27:33 myhostname resticprune[14692]: to repack:        38521 blobs / 11.315 GiB
Apr 04 06:27:33 myhostname resticprune[14692]: this removes      18415 blobs / 5.696 GiB
Apr 04 06:27:33 myhostname resticprune[14692]: to delete:         6192 blobs / 783.231 MiB
Apr 04 06:27:33 myhostname resticprune[14692]: total prune:      24607 blobs / 6.461 GiB
Apr 04 06:27:33 myhostname resticprune[14692]: remaining:       248335 blobs / 125.847 GiB
Apr 04 06:27:33 myhostname resticprune[14692]: unused size after prune: 937.819 MiB (0.73% of remaining size)
Apr 04 06:27:33 myhostname resticprune[14692]: totally used packs:      27758
Apr 04 06:27:33 myhostname resticprune[14692]: partly used packs:         630
Apr 04 06:27:33 myhostname resticprune[14692]: unused packs:              582
Apr 04 06:27:33 myhostname resticprune[14692]: to keep:        25814 packs
Apr 04 06:27:33 myhostname resticprune[14692]: to repack:       2574 packs
Apr 04 06:27:33 myhostname resticprune[14692]: to delete:        582 packs
Apr 04 06:27:33 myhostname resticprune[14692]: repacking packs
Apr 04 07:04:21 myhostname resticprune[14692]: Save(<data/5647688fd6>) returned error, retrying after 626.286518ms: wrote 0 bytes instead of the expected 7420018 bytes
Apr 04 07:06:55 myhostname resticprune[14692]: [39:21] 100.00%  2574 / 2574 packs repacked
Apr 04 07:06:55 myhostname resticprune[14692]: rebuilding index
Apr 04 07:06:58 myhostname resticprune[14692]: Save(<index/449ad00f90>) returned error, retrying after 289.726811ms: wrote 0 bytes instead of the expected 6503207 bytes
Apr 04 07:07:10 myhostname resticprune[14692]: [0:14] 100.00%  27017 / 27017 packs processed
Apr 04 07:07:10 myhostname resticprune[14692]: deleting obsolete index files
Apr 04 07:07:10 myhostname resticprune[14692]: Remove(<index/ac232ebfcd>) returned error, retrying after 547.404299ms: client.RemoveObject: S3 API Request made to Console port. S3 Requests should be sent to API port.
Apr 04 07:07:11 myhostname resticprune[14692]: [0:01] 100.00%  13 / 13 files deleted
Apr 04 07:07:11 myhostname resticprune[14692]: removing 3156 old packs
Apr 04 07:08:06 myhostname resticprune[14692]: Remove(<data/4630239f44>) returned error, retrying after 389.253811ms: client.RemoveObject: S3 API Request made to Console port. S3 Requests should be sent to API port.
Apr 04 07:08:06 myhostname resticprune[14692]: Remove(<data/fd01989b89>) returned error, retrying after 690.271562ms: client.RemoveObject: S3 API Request made to Console port. S3 Requests should be sent to API port.
Apr 04 07:12:25 myhostname resticprune[14692]: [5:13] 100.00%  3156 / 3156 files deleted
Apr 04 07:12:25 myhostname resticprune[14692]: done
Apr 04 07:12:25 myhostname resticprune[14692]: Finished restic prune. Exit code: 0
Apr 04 07:12:25 myhostname resticprune[14692]: Exiting with summary exit code 0

Another one from a different host:

Apr 04 18:06:17 myotherhost systemd[1]: Started Restic prune.
Apr 04 18:06:17 myotherhost resticprune[2368346]: Restic prune started.
Apr 04 18:06:17 myotherhost resticprune[2368346]: Pruning backup sets...
Apr 04 18:06:21 myotherhost resticprune[2368347]: loading indexes...
Apr 04 18:06:22 myotherhost resticprune[2368347]: loading all snapshots...
Apr 04 18:06:23 myotherhost resticprune[2368347]: finding data that is still in use for 4 snapshots
Apr 04 18:06:28 myotherhost resticprune[2368347]: [0:04] 100.00%  4 / 4 snapshots
Apr 04 18:06:28 myotherhost resticprune[2368347]: searching used packs...
Apr 04 18:06:32 myotherhost resticprune[2368347]: collecting packs for deletion and repacking
Apr 04 18:08:05 myotherhost resticprune[2368347]: List(data) returned error, retrying after 582.280027ms: Client disconnected before response was ready
Apr 04 18:11:22 myotherhost resticprune[2368347]: Save(<lock/508d4d1d5d>) returned error, retrying after 400.45593ms: wrote 0 bytes instead of the expected 176 bytes
Apr 04 18:21:22 myotherhost resticprune[2368347]: Save(<lock/b504e93526>) returned error, retrying after 391.517075ms: wrote 0 bytes instead of the expected 177 bytes
Apr 04 18:26:22 myotherhost resticprune[2368347]: Save(<lock/1664c76374>) returned error, retrying after 430.435708ms: wrote 0 bytes instead of the expected 177 bytes

The storage endpoint is a self-hosted Minio S3 implementation, which shows no errors in the log. I don’t think that Minio per se is not suitable, as they are referencing Restic as example use case.

Now the obvious questions is: What could be the reasons for these errors? How could they be tracked down and prevented?

Furthermore: Why is Restic returning 0 as return code after several errors leading to data loss have been occured? Is this a bug?

Kind regards

You don’t have errors here. You seem to have connection problems (maybe a unstable connection? or temporary issues with your minio?) and then restic is retrying and successfully finishing the operation.

2 Likes

Please always include information which restic version you’re using. Based on the log output I can tell, that it is 0.12.0, which means that you’re using an outdated version. Please upgrade to at least 0.12.1 to get the actual error message instead of the wrote 0 bytes instead ... output.

client.RemoveObject: S3 API Request made to Console port. S3 Requests should be sent to API port. that error indicates that something is seriously wrong with the minio setup. You can try whether a more recent restic version changes something about that, but it looks like a problem of the minio setup.

2 Likes

Oh and please provide an example of the errors reported by check. Without it we cannot investigate the cause of the damaged snapshots.

2 Likes

Hello,
thanks to your thoughtful reply, I reviewed the storage setup and was able to find a misconfiguration. I am pretty confident that this was causing errors. Rebuilding the backup sets is still ongoing, but it’s looking good. Thanks!