Restic check/prune fault-tolerant?

tommy · June 16, 2021, 1:47am

Hi there,

I am thinking of running Restic check/prune for a number of Restic repos in S3 on ECS Fargate Spot. As a Fargate Spot Task may be interrupted at any time, I am wondering if it is safe for check and prune operations to run on it.

alexweiss · June 16, 2021, 4:33am

As check (mainly) only reads data from the repository, it is safe to interrupt it.

prune is designed to first add files and then delete spare files. So interrupting it will not leave a broken repository. But resuming an interrupted prune may need extra-effort (CPU-time, traffic and requests to the storage) to complete. If you regularly do this, you should try out

github.com/restic/restic

prune: Handle duplicate blobs more efficiently

restic:master ← aawsome:prune-handle-duplicates

opened 04:06PM - 19 Feb 21 UTC

aawsome

+90 -33

What does this PR change? What problem does it solve? -------------------------…---------------------------- Enhances treatment of duplicates during `prune`. Now an algorithm is implemented that marks each duplicate either as "used" or as "unused" (such that pack files containing only duplicates should be marked as completely unused and hence can be removed). As a side effect, `prune` now can keep duplicates if unused space is allowed by `--max-unused` and the statistics are accurate even if not all packs with duplicates are repacked. Was the change discussed in an issue or in the forum before? ------------------------------------------------------------ closes #3114 (at least all points except the treatment of duplicates where one of the duplicates is damaged) Checklist --------- - [x] I have read the [Contribution Guidelines](https://github.com/restic/restic/blob/master/CONTRIBUTING.md#providing-patches) - [x] I have enabled [maintainer edits for this PR](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/allowing-changes-to-a-pull-request-branch-created-from-a-fork) - I have not added tests for all changes in this PR - I have not added documentation for the changes (in the manual) - [x] There's a new file in `changelog/unreleased/` that describes the changes for our users (template [here](https://github.com/restic/restic/blob/master/changelog/TEMPLATE)) - [x] I have run `gofmt` on the code in all commits - [x] All commit messages are formatted in the same style as [the other commits in the repo](https://github.com/restic/restic/blob/master/CONTRIBUTING.md#git-commits) - [x] I'm done, this Pull Request is ready for review

which will be able to speed things up a bit.

That said, there is an issue with lock files, as both check and prune exclusively lock the repo. So you might have to manually remove the locks after an abort.

cdhowie · June 16, 2021, 7:16pm

Generally-speaking, I believe all restic operations are safe to interrupt as they follow an add-first-delete-last policy such that interrupting at any point leaves the repository in a consistent state, though possibly with redundant data. Dependent data is also added before data that depends on it – e.g. file leaf nodes are added before the tree they are contained in, the root tree for a snapshot is added before the snapshot that points to it, and pack indexes are not uploaded until the respective packs are uploaded.

Absent any bugs, the only effects of an interrupted command should be that there is possibly redundant data added to the repository which can later become referenced by backup or removed by prune.

The orphaned lock files, as pointed out by @alexweiss, are really the only concern as these affect the operation of the repository. According to the AWS ECS documentation:

When tasks using Fargate Spot capacity are stopped due to a Spot interruption, a two-minute warning is sent before a task is stopped. The warning is sent as a task state change event to Amazon EventBridge and a SIGTERM signal to the running task.

As long as the SIGTERM reaches restic, it will clean up the lock file before exiting. If restic is not running as pid 1 inside the container, whatever is running as pid 1 will need to forward the signal to restic.

tommy · June 16, 2021, 9:05pm

Thank you guys, that’s very insightful.