Risks/consequences of prune when backup in progress or vice versa

jeetsukumaran · June 16, 2018, 5:12pm

I would like to schedule the backup snapshots independently from the forget/prune operations. E.g., the backups to take place hourly, but the prunes to take place once a day. Or, backups every four hours and prunes once a week. Etc.

From the scheduler standpoint, the simplest would be to consider these two separate services to be scheduled independently. However, if set up naively this does entail the risk of one operation starting while the other is ongoing. I could try and space them out but there is no guarantee that, e.g., a particularly heavy backup or prune might take much longer than expected (or the internet connection is slow etc.).

My question is: will restic handle this possible situation robustly/safely/gracefully, or will I have to take special measures to ensure that this never happens? I know that restic creates a lock file, but following numerous examples on the web, my backup script routinely clears the lock file before backing up or pruning so this failsafe will not be available to restic. I can, however, have the script itself create its own lock so that it knows not to call prune/backup if another instance of it is already running. Alternatively, maybe I should not clear the locks in the first place? Or, I suppose, if the prune/backup class is handled robustly (in the way that multiple independent backups to the same repos are), I would not worry about it?

matt · June 16, 2018, 5:18pm

Restic requires an exclusive lock on the repository to perform a prune, and it can’t get that while a backup is running. So, schedule away: and restic will not perform unsafe operations on the repo.

Why? That sounds unsafe. If you have a stale lock, sure, go ahead and remove it; but don’t clear the locks while an operation is running on the repo.

jeetsukumaran · June 16, 2018, 5:31pm

Thanks, @matt.

How does my driver script know if a lock is stale or not?

Both the backup and prune have logic like incorporated:

github.com

erikw/restic-systemd-automatic-backup/blob/master/usr/local/sbin/restic_backup.sh#L49-L51


# Remove locks from other stale processes to keep the automated backup running.
restic unlock &
wait $!

i.e., they both clear locks automatically before beginning.

Is this a bad idea?

If so, how do I handle stale locks?

jeetsukumaran · June 16, 2018, 5:35pm

For reference, here is the script that I am running:

gist.github.com

https://gist.github.com/jeetsukumaran/61ff0033360174cda99ed3b444ba6dac

bu

#!/usr/bin/env bash
#
# bu: Backup data to repository.
#
# Usage:
#
#   bu [OPTIONS] <PATH/TO/BACKUP/CONFIGURATION>
#
# Type 'bu --help' for help on 'bu' options.
#

This file has been truncated. show original

My plan is to schedule it to run once every 4 hours with the “–no-trim” (=backup, but with no forget + no prune) option, and once a day with the default forget and prune. As you can see, ‘restic unlock’ is called unconditionally in either case. Not advisable?

fd0 · June 16, 2018, 7:10pm

For the reference: restic unlock will only remove locks it considers stale, which are locks that are older than ~5 minutes. When a process (e.g. backup or prune) is running, the lock is replaced every 5 minutes, so this is still safe.

However, I advise people to rather find the source of the lock files in the repo, so that unlock isn’t needed in a script.

jeetsukumaran · June 16, 2018, 7:46pm

Thanks, @fd0.

Would it be safer, then, to replace

if [[ $IS_UNLOCK || $IS_BACKUP || $IS_FORGET_AND_PRUNE ]]
then
    unlock
fi

with

if [[ $IS_UNLOCK ]]
then
    unlock
fi

in the script? This way, the script will only call unlock when explicitly requested by the user, and then dealing with stale locks becomes a user responsibility? The danger here is that the possibility that, in the event of a stale lock, multiple backups might be missed until the user intervenes (which may be a while).

matt · June 17, 2018, 6:42am

Oh! This is awesome, I didn’t know this! Thanks for this tid-bit.

fd0 · June 17, 2018, 7:29am

Ah, I was wrong, the docs state what’s happening here:

https://restic.readthedocs.io/en/latest/100_references.html#locks

The field exclusive defines the type of lock. When a new lock is to be created, restic checks all locks in the repository. When a lock is found, it is tested if the lock is stale, which is the case for locks with timestamps older than 30 minutes. If the lock was created on the same machine, even for younger locks it is tested whether the process is still alive by sending a signal to it. If that fails, restic assumes that the process is dead and considers the lock to be stale.

jeetsukumaran · June 17, 2018, 2:12pm

Ah, so this implies that the process of clearing locks should not or at least rarely be needed at all?

If a new operation begins that potentially conflicts with an ongoing one (e.g., a prune vs backup), then the lock will not be considered stale due to its newness or the processing writing to it being alive, and the new operation will fail.

Conversely, if new operation starts and finds a lock and it is genuinely a stale lock, given its age or that the process not being found or both, then it will clear/override the lock and proceed?

So, either way, things Just Work (correctly) without the client need to call unlock?

fd0 · June 17, 2018, 6:39pm

That’s correct, normal (non-interrupted) operations should not leave locks in the repo.

For safety reasons, that’s not done automatically. But it could be, someday. Running restic unlock without further parameters won’t remove locks it does not consider stale.

In most cases, yes.

matt · June 17, 2018, 7:20pm

This has been really helpful. So, as I understand it, stale locks should usually only occur when an operation is interrupted and restic can’t clean up the lock. Unlock is thus mostly useful after an interrupted operation or a network failure or something.

jeetsukumaran · June 17, 2018, 7:57pm

Makes sense.

So, perhaps the client script logic should be:

restic ....
if [[ $? == 1 ]]
then
       restic unlock
fi

so that failed operations clear the lock? Similar, if a signal interrupt is caught and then an unlock is issued from the driving script?

jeetsukumaran · June 19, 2018, 12:09am

It occurs to me that one problem with the “if [[ $? == 1 ]]; then; restic unlock; fi;” logic is that if the restic process fails because of a valid lock from some other ongoing (live) process, then this will clear it. Which we don;t want to do.

Dj0k3 · September 1, 2018, 9:55pm

I’m trying to improve my own script and this is something that is getting in my head (if running unlock or not in the script). So, I made a testing-repo so I can see restic running and the thing about the locks files are that, if I’m right, at the end of a restic command it’s not supposed to be any locks in the repo. That’s what I saw but I don’t know if I’m totally right or not about this. Assuming that is true and that your repo (in my case, my testing-repo) contain just one host (just one machine backing up to that repo), then we can do something like this at the end of the script:

if [ -e $RESTIC_REPO/locks/* ]; then
  restic unlock > /dev/null 2>&1
else
  >/dev/null
fi

Am I right about this? What I’m trying to tell the script to do is that if there’s any lock at the end of all processes, then run unlock. I was just running restic unlock at the beginning of my script but if every process creates a lock and then remove the lock at the end, I don’t see why to unlock at the beginning anymore. So, logic tells me that if my process is the only one running in that repo and it is the only process running, then at the end there should not be any lock files and it should be safe to run unlock if there’s any present lock in the repo. I would really appreciate if someone tells me if this is right or not.

fd0 · September 2, 2018, 9:11am

Exactly, leftover lock files should only occur when restic was aborted forcefully (i.e. not by pressing ^C, which will clean up locks).

You can do that, but you can also just run restic unlock which will do nothing if no locks are present.

That’s correct.

Usually you shouldn’t need to run unlock at all…

Dj0k3 · September 5, 2018, 1:17am

Thank you @fd0 for your answers. I was worried that I was doing something wrong running unlock but in my case, that I have just one machine in one repo, I think it could not hurt to use it in case my system kill the process for some reason or my laptop power off if it’s not charged. Thanks again, I’m really enjoying using Restic with all my machines and so far is the best backup program I’ve tested.

Fun fact: I use to have a repo with 80GB with Borg and the same amount of data is now 60GB using Restic. Again, the same amount of data, same “keep snapshots” setup and I really can’t understand why because as far as I know Borg uses compression and Restic doesn’t. Thanks for your work and for making it free and open.

fd0 · September 6, 2018, 7:13am

So, the restic repo is smaller and contains the same data? That’s a bit odd, restic indeed does not support compression (yet), and borg does it by default as far as I know. Strange.

764287 · September 6, 2018, 7:47am

AFAIK borgbackup uses different default values for the chunker. According to this text the chunker parameters can have quite some impact on the repository size. Maybe borgbackup’s default values are somewhow unfitting for your data set. Even though the 33% difference in your case really is a lot.

Dj0k3 · September 7, 2018, 5:31pm

Yes, it is a bit odd indeed. I have a little bit more data by now that I had by that time. The only “big difference” could be a VM (I didn’t remembered that until now) that I had by the time for testing but it was like 10GB and it was deleted anyways so the total amount of data should have remained the same because the VM was deleted and I’m pretty sure it was also deleted from the repo.

It could be. I was doing backups with the default parameters. I didn’t thought about that, tho. I have a lot of directories and small files. The tree output says I have 6,930 directories and 57,476 files (that’s for all); excluding hidden stuff and just including directories I work with every day, there are 1,531 directories and 7,591 files. Most files are documents with less than 3MB except for some books and backup files (SQLite databases and compressed databases) that are not more than 60MB each.

Total amount for backups with Restic (48 snapshots) are 2.32TiB using restic stats and 65.46GiB using restic stats --mode -raw-data.

Quick question: using rsync to replicate a Restic repo to another HDD, it should be enough to use rsync -rvah or it is a better way to sync? I’m asking because I saw an error 23 in the rsync output and I don’t really know what is that.

Torslanda · September 17, 2018, 11:37am

I have the same issue as the OP. I have two scripts scheduled: one daily which runs backup and one weekly which runs a forget/prune and check. My backup script began to run when first scheduled, but because there is a lot of data it will take days/weeks for this to upload to the repository. In the meantime, my first scheduled weekly prune script ran last night but appears to have cleared the lock. Based on comments in this thread I thought it should be safe to leave the ‘unlock’ command in as part of my prune script; the reasoning for this is that if there’s still a backup in progress there ought to be an active lock and therefore, as indicated above, the unlock command should not really unlock it. If on the other hand the backup process had been interrupted for whatever reason, I wanted the stale lock to be removed so that scripted operations could continue.

However, it looks like it has removed my active lock whilst the backup was running, based on the output of my prune script (see below) and my script is now complaining that there are packs not referenced in any index and that my repository contains errors. I therefore fear that it has removed the lock, begun to prune whilst my backup was in progress, and corrupted my backup, meaning that I’ll need to start it over again. Can anyone tell me what’s happening here and whether my interpretation is correct?

Thank you!

The output from my prune script is as follows:

    backup-purge.service - Purge backup snapshots
       Loaded: loaded (/etc/systemd/system/backup-purge.service; disabled; vendor preset: enabled)
       Active: failed (Result: exit-code) since Mon 2018-09-17 00:01:41 BST; 14ms ago
         Docs: https://restic.readthedocs.io/en/latest/
      Process: 20767 ExecStart=/usr/local/bin/restic check (code=exited, status=1/FAILURE)
      Process: 20643 ExecStart=/usr/local/bin/restic forget --prune --keep-daily 7 --keep-weekly 4 --keep-monthly 12 (code=exited, status=0/SUCCESS)
      Process: 19911 ExecStartPre=/usr/local/bin/**restic unlock (code=exited, status=0/SUCCESS)**
     Main PID: 20767 (code=exited, status=1/FAILURE)

    Sep 17 00:01:40 excalibur restic[20767]: pack 7d84f01110fdd5aa1c870b566df824be95f0e9a90d85868d4fd420ad8c3ba44a: not referenced in any index
    Sep 17 00:01:40 excalibur restic[20767]: pack 381c899ce3ec93df101695927a733915f59ad68fd534daaa25f6dd9f5e38dbc2: not referenced in any index
    Sep 17 00:01:40 excalibur restic[20767]: pack c4ab339ae3811f36226eb8d85ee852e6ce2bc44f9eca7c381f94fe25c963ebd4: not referenced in any index
    Sep 17 00:01:40 excalibur restic[20767]: pack df9e32435edda75a7860d4864ab75ff4770839b8c2c4cde2de39fef9f12780ff: not referenced in any index
    Sep 17 00:01:40 excalibur restic[20767]: pack 6277660112318d3c556340e5cd64b5ae22ab8ab35b4c64bd93957643f5e0877c: not referenced in any index
    Sep 17 00:01:40 excalibur restic[20767]: pack 1a06eb07f82e360e939c77ae93ba458a21e0daaf8b49ddbd3668e7237e5da1cb: not referenced in any index
    Sep 17 00:01:40 excalibur restic[20767]: pack e66a7abdc8feb268a764e83cdca175d8c8a5e73431cd2b2ef28400232c5720b1: not referenced in any index
    Sep 17 00:01:40 excalibur restic[20767]: pack cdbeff1ff35a6c3e44f84e985518af33bd3c13b0adf9946fb5e3e1bc51176035: not referenced in any index
    Sep 17 00:01:40 excalibur restic[20767]: pack de013ac4ebbf370dd4988ad8d70e25a0c7b0447108ff7257abf1d40b1551516f: not referenced in any index
    Sep 17 00:01:40 excalibur restic[20767]: pack 9d66847e58888825270e9899017bbe9bd583778b306f8fa91bc2318cab62706f: not referenced in any index
    Sep 17 00:01:40 excalibur restic[20767]: pack 2c5fcaf6b0123e1608a21764810d116621f9579dca9acb15ca81d93497aa28d7: not referenced in any index
    Sep 17 00:01:40 excalibur restic[20767]: pack 2d14752028d9337e36fda338e96f474c57c7598df7e1d0fd2db000632bc9377a: not referenced in any index
    Sep 17 00:01:40 excalibur restic[20767]: pack 5d9fd87dcc11371a0134fefbd552c198953125200e716ca81df167a791b3bc6a: not referenced in any index
    Sep 17 00:01:40 excalibur restic[20767]: pack df4a5d5828295be965191be2f465cf05b3a952a469fec58f3105f059b569f849: not referenced in any index
    Sep 17 00:01:40 excalibur restic[20767]: pack b9b0963b513e6f5ccf3ca64e657dd0c1f21140a42167ad7d9d999124bcd56ea5: not referenced in any index
    Sep 17 00:01:40 excalibur restic[20767]: pack 4122eb29f8ba85e468d299bf3245c4073729f974bbae2d4f75fd062610fe95c6: not referenced in any index
    Sep 17 00:01:40 excalibur restic[20767]: pack 5b2f78e2b4ba70a86dce6cc8b3c5b9066fa49d3e1e17677fe201b86181231f3e: not referenced in any index
    Sep 17 00:01:40 excalibur restic[20767]: pack be56b8b98e9e0c646d119439d650fe378a8dabb3b6020ca70640faba7bfcaa45: not referenced in any index
    Sep 17 00:01:40 excalibur restic[20767]: pack 0e2d9dc5c06e0b788d72324eeb5b40a3ecc72560fca5640e33d7bc99936d47e4: not referenced in any index
    Sep 17 00:01:40 excalibur restic[20767]: pack e9737f843c56d4440d4e1bb0419d5eed41f001dd73c63a9f3a9e8c8a7e049281: not referenced in any index
    Sep 17 00:01:40 excalibur restic[20767]: pack e425398dad99926889fc08bff245acb3dde22f64cee2111446bcbeb3435e1a87: not referenced in any index
    Sep 17 00:01:40 excalibur restic[20767]: pack 4703928d0ee4dd36635d847c8d7d43df8587ac872de744580dd44a838e0c3873: not referenced in any index
    Sep 17 00:01:40 excalibur restic[20767]: pack 1e0a39162500dff6c8598f309ba4e0a1c057e0185e88c1a6a534990a3f3ff0ee: not referenced in any index
    Sep 17 00:01:40 excalibur restic[20767]: pack 329b891e7e906745de56624c66963bfeb9a2b7fe8bca1494f79e449061182196: not referenced in any index
    Sep 17 00:01:40 excalibur restic[20767]: check snapshots, trees and blobs
    Sep 17 00:01:41 excalibur restic[20767]: Fatal: repository contains errors
    Sep 17 00:01:41 excalibur systemd[1]: backup-purge.service: Main process exited, code=exited, status=1/FAILURE
    Sep 17 00:01:41 excalibur systemd[1]: backup-purge.service: Failed with result 'exit-code'.
    Sep 17 00:01:41 excalibur systemd[1]: Failed to start Purge backup snapshots.
    Sep 17 00:01:41 excalibur systemd[1]: backup-purge.service: Triggering OnFailure= dependencies