Restic prune crashing regularly

mdw · March 30, 2020, 1:06pm

After getting an error from restic check which recommends a restic prune, I regularly get a crash. I am probably doing something incorrectly, but what?

Here is the scenario: I have multiple clients pushing to a single server using the ssh backend with restic 0.9.6. I perform a weekly restic check and roughly once a month see something like:

successfully removed locks
using temporary cache in /tmp/restic-check-cache-181103040
created new cache in /tmp/restic-check-cache-181103040
create exclusive lock for repository
load indexes
check all packs
pack 29d6ed70: not referenced in any index
pack 0f8ccb44: not referenced in any index
pack 1b995fb7: not referenced in any index
pack 2e917135: not referenced in any index
pack 05c9b990: not referenced in any index
5 additional files were found in the repo, which likely contain duplicate data.
You can run restic prune to correct this.
check snapshots, trees and blobs
error for tree 9f4af876:
tree 9f4af876cd81452b78841e5a4998b9b26aa80e2e2c498da3e558e461fd247956 not found in repository
Fatal: repository contains errors

I then, probably incorrectly, try to run a restic prune and then I get the following:

repository f913a463 opened successfully, password is correct
counting files in repo
building new index for repo
[16:09] 100.00% 80258 / 80258 packs
repository contains 80258 packs (2692991 blobs) with 379.562 GiB
processed 2692991 blobs: 11 duplicate blobs, 5.412 MiB duplicate
load all snapshots
find data that is still in use for 102 snapshots
tree 9f4af876cd81452b78841e5a4998b9b26aa80e2e2c498da3e558e461fd247956 not found in repository
github.com/restic/restic/internal/repository.(*Repository).LoadTree
/restic/internal/repository/repository.go:713
github.com/restic/restic/internal/restic.FindUsedBlobs
/restic/internal/restic/find.go:11
main.pruneRepository
/restic/cmd/restic/cmd_prune.go:191
main.runPrune
/restic/cmd/restic/cmd_prune.go:85
main.glob…func18
/restic/cmd/restic/cmd_prune.go:25
github.com/spf13/cobra.(*Command).execute
/restic/vendor/github.com/spf13/cobra/command.go:762
github.com/spf13/cobra.(*Command).ExecuteC
/restic/vendor/github.com/spf13/cobra/command.go:852
github.com/spf13/cobra.(*Command).Execute
/restic/vendor/github.com/spf13/cobra/command.go:800
main.main
/restic/cmd/restic/main.go:86
runtime.main
/usr/local/go/src/runtime/proc.go:203
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357

This situation can be corrected by a restic rebuild-index after which restic prune and restic check complete successfully.

Is there some way of preventing this from happening in the first place?

Thanks.

ProactiveServices · March 30, 2020, 1:16pm

You could have underlying storage or RAM problems which are presenting as repo corruptions. When you run a restic check are you using --read-data? This is very important if there is even suspicion of corruption in the repo.

mdw · March 30, 2020, 1:47pm

Thanks, yes. The first few times this happened, I did both restic check --read-data and offline memtest successfully. But, I don’t do this every time since it is a 400GB repo and the read data check takes many hours. I’ll run one now for safe measure.

[4 hours later] Yes, it the restic check --read-data completed without error.

MichaelEischer · March 30, 2020, 9:51pm

The check output is actually a bit confusing, the reference to restic prune refers to the not referenced in any index lines. The tree error later on indicates a separate problem. prune is designed to not remove anything in case some data seems missing. And yes, a reasonable error message would be much better.

I guess the server stores the backup repository (a single one or one per client?) on a local disk? Did any errors / unusual things happen during the backup run?

not referenced in any index should only be caused by interrupted backups, but in that case there would be no snapshot referring to those packs, no tree error in check and no crash in prune. Restic by design (should) never add a reference to a tree, while the corresponding pack is not yet included in the index. So either this is a hardware problem, unexpected behavior of the repository storage or a bug in restic.

Could you take a look at the creation date of the packs that were not referenced in the index (e.g. pack 29d6ed70 would be at data/29/29d6ed70...)? My guess would be that these were created at roughly the same time. Working backwards from that it would be rather helpful to know which restic command (backup/prune/…) was running at that time.

mdw · March 30, 2020, 11:03pm

Yes, the restic backups get run every four hours on the client machines to a single server. There are three clients. One of the clients is a server that’s only powered when needed. The other two are laptops which may be suspended. The server is always on.

There is no data/29/29d6ed70...or any of the other packs listed in the error message from restic check. Could it be that the subsequent prune removed this? I could correlate the timestamp on the logs generated by systemd but I don’t have a pack to reference.

It is possible that one of the clients was suspended during a backup. Other than that, there were no ‘incidents’ E.g. the server has been stable for months.

The good news with all of this is that restic has been recoverable each time this has happened. But it makes me nervous, and takes time to fix. From reading the forum, I thought that I understood that restic could handle the interrupted backup (i.e. from a suspend). Perhaps not? Should I be checking for an active restic process before suspending?

Of course, I can’t rule out that I do not have a hardware problem. Although, as far as I can tell it, only restic is affected. I’d think that restic check --read-data is a fairly hard workout for 4 hours and would make (e.g.) memory fail is it was prone to that.

MichaelEischer · March 31, 2020, 10:12am

That would be possible. Could you try to locate the snapshot for tree 9f4af876[...] using restic find --tree 9f4af876cd81452b78841e5a4998b9b26aa80e2e2c498da3e558e461fd247956? Maybe that tree still exists and can tell which client created it? I forgot to mention earlier that there’s one more interesting thing about that tree: It’s the root tree node, that is it is the last tree which restic creates during a backup run.

As your clients backup over ssh (using sftp?) suspending a client should cleanly cut off a backup run at a certain point, as the sftp backend is currently unable to reconnect. However, in that case the client would also leave a lock file behind, which you’d notice on the next check / prune run.

In theory (that is excluding potential bugs) it’s safe to interrupt any restic command at an arbitrary point without loosing data or having to rebuild the index. The rebuild-index and prune commands could leave lots of stale files behind but that should be the worst that can happen. In case of a backup run restic (hopefully) only writes the snapshot when all data included in that snapshot and the attached indexes were safely stored. An aborted backup run can leave some data behind, which can be picked up by subsequent backup run, but only if that data was stored properly.

As your repository just seems to miss some packs in the index, but is fully valid after rebuilding the index, this doesn’t look like a hardware problem to me right now. Did you run restic check --read-data on the server hosting the repository or on all clients? A hardware problem could show up on either side. However, the usual symptoms would be some sort of data corruption, but for your repository it just seems like an index is missing.

mdw · March 31, 2020, 2:26pm

Thanks for the info. I did try to locate the tree using restic find --tree but it is not found. Yes, the clients are using the sftp backend.

Yes, I did run restic check --read-data for all clients: all clients are writing to the same repo to take advantage of dedup.

I also do not suspect hardware at this point, although anything is possible, I suppose. The problem, when it occurs, always has the same pattern: a failed check which rebuild-index corrects. I suppose, next time, I’ll rebuild-index first. Does that make sense?

MichaelEischer · March 31, 2020, 8:36pm

Running rebuild-index before prune after check has complained, should work as a workaround. I’d really like to investigate what’s happening here, if that’s possible for you, as having to run rebuild-index indicates a problem somewhere.

Yes, I did run restic check --read-data for all clients: all clients are writing to the same repo to take advantage of dedup.

I’m not sure whether we’re talking about the same thing: Your reply sounds to me like you ran the restic check [...] command once on the server and passed it the repository that contains the data of all clients. That would only stress test the server. With “on all clients” I had in mind to run the command once one every single client and let them check the full repository each. That would stress test one client after the other. That doesn’t mean that I’m asking you to run check on all clients: I currently don’t think that such a stress test would provide many insights.

mdw · March 31, 2020, 8:55pm

Yes, that is correct. I ran restic check --read-data on the server. I can try a few of the clients. It will take some time . . . I’ll report back.

mdw · April 2, 2020, 11:57am

Results: restic check --read-data completed without errors on all three clients and the server.

MichaelEischer · April 2, 2020, 5:31pm

So I guess we just have to wait until the error shows up again and then try to find out what happened. Sorry for the trouble with your backup.

mdw · April 2, 2020, 5:48pm

I’m very thankful for restic. It’s a great concept and despite these issues, it has served me well for 18 months.

If there is a bug, I’d like to help you as I can. I’m not a go programmer (although I’ve done the go tutorial just to get the flavor), but I can provide info.

So, my question now is: what to do to provide helpful information when this happens again? This is the 3rd or 4th time that this same issues has repeated itself (esp. as my repo has grown, it’s up to 400GB) so I’m guessing it’s just a matter of waiting.

Of course, of there is a problem with some of my hardware, I’d like to find that too.

ProactiveServices · April 2, 2020, 6:40pm

Some of these suggestions may not be appropriate depending on the server use, when you have access etc. To check that it’s not a RAM issue, a MemTest+ run for as long as is convenient. I have, on rare occasions, only found RAM errors after three successful passes.

Check the vendor-specific and SMART (if available) attributes of your storage devices and run their entire suite of diagnostics checks. System logs may show up storage faults - filesystem checks being required, reports of failed writes or controller timeouts. Look for evidence of other I/O intensive tasks failing or reporting unusual faults.

Locks, as discussed, are a good way of finding out if backup operations have indeed been interrupted. Perhaps a regular script to check for client connections as well as the existence of lock files. Any locks when there are no client connections could be the result of an interruption.

Somewhat more drastic/thorough methods could be disabling caching on storage devices, swapping out a device for a replacement to see if the problem is resolved (which would also allow for write-tests on any removed devices). Depending on your hardware you could try swapping out data cables and using different ports for storage devices.

MichaelEischer · April 2, 2020, 6:42pm

The first step would be to find out when the problem was caused. When check starts to complain again about a missing tree along with unreferenced indexes, then please take a look at the following:

Look up the timestamps when the not referenced packs were created. Or maybe just ls -laR /path/to/repo the whole repository and then pick out the interesting parts.
Then also locate the snapshots affected by the missing tree(s) using restic find --tree <missing-tree-id>. Restic should complain that it can’t load the missing tree for some snapshots. Look up the affected snapshot(s) and take a note of the backup (start) time and the host. Also locate the snapshots in the backup repository (repo/snapshots/<id>) and check when these were created, this tells us when the backup run finished. Were the packs which are missing from the index created at the beginning or the end of the backup run or at a completely different time?
Check the logs of the client that created the problematic snapshots for anything unusual. Was the client maybe suspended during the backup run? Or did restic complain about a backend error after the snapshot was already created?

wdbacker · April 10, 2020, 7:52am

Are the disks on your server connected to a RAID controller card? Years ago, I detected problems with a certain make of RAID controller with a firmware bug, writing data blocks to wrong physical sectors on disk. This error occurred at random times and was reproduced on both Windows Server and Linux. This kind of “hardware” issues can also explain the issues you are seeing.

MichaelEischer · April 10, 2020, 9:23am

Wouldn’t such RAID controller bugs permanently damage files in a repository? As the error here is that some packs are just missing from the repository index, you’d need a hardware bug that manages to just let an index file vanish without causing any other filesystem or data corruption.

mdw · April 10, 2020, 12:38pm

No, although I have had the same problems with hardware raid cards. I am running software raid0. But, I have checked all the smart parameters and run offline disk diagnostics and everything checks. I have also run memtest86 overnight (6 passes) without any errors.

mdw · April 27, 2020, 2:49pm

@MichaelEischer : I have not forgotten about this and have been monitoring regularly. There has not been another failure, but I did change one thing and and leads to something I’d like to run by you:

When I was just starting with restic, I copied and adapted some systemd scripts for running restic tasks (backup, clean, etc.). In those scripts, before the main restic command, were restic unlock commands to prevent the unattended backups from failing unnecessarily.
Three weeks ago, I commented out those commands in my scripts on all clients. My thinking or hunch was: could it be that restic unlock is actually unlocking an active lock rather than a stale lock?
Since doing that, I have not seen another failure; e.g. no problems in running restic prune and no issues found using restic clean run at least once per week.

Maybe to early to say since the error is intermittent, but is it possible that restic unlock occasionally removes active locks? And if so, could it lead to the weird index corruption that I was experiencing?

Or maybe this is a crazy thought.

ProactiveServices · April 27, 2020, 6:03pm

FYI I run a restic unlock (without --remove-all) and restic cache --cleanup before all of my restic operations that are automated or scripted, across many different systems and workloads. I’m yet to encounter any frequent or recurring problems.

MichaelEischer · April 27, 2020, 6:09pm

The unlock theory reminds me of https://github.com/restic/restic/pull/2391 (currently only part of the master branch) without which it’s indeed possible to remove active locks via unlock. If an active log is deleted you would, however, be able to tell from the logs. The active restic run would in that case print lots of Remove(<lock/0123456789>) returned error, retrying after 3.1415s.

If the lock of an active prune run is deleted, then a concurrent backup could add new packs and indexes with the latter (only the indexes) getting possibly deleted at the end of the prune run (or more precisely: the prune run might miss some concurrently added packs while rebuilding the index). The linked pull request would only be an explanation if the prune run takes longer than 30 minutes.

@ProactiveServices if the prune and backup runs are scheduled not to overlap with each other then you can’t run into problems. If they overlap, then you should see backup or prune fail from time to time as the repository could still be locked when the command is started.