Restic check/prune fault-tolerant?

Hi there,

I am thinking of running Restic check/prune for a number of Restic repos in S3 on ECS Fargate Spot. As a Fargate Spot Task may be interrupted at any time, I am wondering if it is safe for check and prune operations to run on it.

As check (mainly) only reads data from the repository, it is safe to interrupt it.

prune is designed to first add files and then delete spare files. So interrupting it will not leave a broken repository. But resuming an interrupted prune may need extra-effort (CPU-time, traffic and requests to the storage) to complete. If you regularly do this, you should try out

which will be able to speed things up a bit.

That said, there is an issue with lock files, as both check and prune exclusively lock the repo. So you might have to manually remove the locks after an abort.

2 Likes

Generally-speaking, I believe all restic operations are safe to interrupt as they follow an add-first-delete-last policy such that interrupting at any point leaves the repository in a consistent state, though possibly with redundant data. Dependent data is also added before data that depends on it – e.g. file leaf nodes are added before the tree they are contained in, the root tree for a snapshot is added before the snapshot that points to it, and pack indexes are not uploaded until the respective packs are uploaded.

Absent any bugs, the only effects of an interrupted command should be that there is possibly redundant data added to the repository which can later become referenced by backup or removed by prune.

The orphaned lock files, as pointed out by @alexweiss, are really the only concern as these affect the operation of the repository. According to the AWS ECS documentation:

When tasks using Fargate Spot capacity are stopped due to a Spot interruption, a two-minute warning is sent before a task is stopped. The warning is sent as a task state change event to Amazon EventBridge and a SIGTERM signal to the running task.

As long as the SIGTERM reaches restic, it will clean up the lock file before exiting. If restic is not running as pid 1 inside the container, whatever is running as pid 1 will need to forward the signal to restic.

2 Likes

Thank you guys, that’s very insightful.