Restic backup behaviour when interrupted - s3 backend

Hi,

I’m using restic to backup our databases to an s3 compatible object storage bucket in the cloud. In most cases, the database is one single file, size > 10 GB.

As I’m backing up directly to the cloud over the internet, I’m thinking about what happens when the internet connection is suddenly dropped or interrupted during backup. What would you recommend as the most sure way to check if the backup was successful?

To start the backup, I’m running a “restic backup …” command: will this always return an error in such a case? Or do you recommend to add a “restic check …” command after the backup each time to be sure everything went fine?

Btw, my experiences with restic are really great so far! Using it on Windows Server machines from Powershell scripts.

Thanks in advance!

If a backup is interrupted you can just restart it - restic will only upload the parts of the data that hasn’t already been uploaded.

When restic finishes a backup, it will tell you that it created a snapshot, and what the new snapshot’s ID is. I would scan the output for this information and only conclude that the backup has been finished when I find that information. That should be enough, no need to run a check just to know if a backup finished, that will be expensive in the long run.

@rawtaz Thanks for the info - didn’t know you can just restart it!

ATM I’m checking restic’s exit code: if it’s non-zero I’m considering the backup as failed. Would that be enough? I can add a regex match on the output to find the snapshot id too if it’s needed.

Yes, if it’s not zero you can restart.

1 Like

Thanks for the reply’s, I did more testing yesterday and my findings about backup resiliency are very good:

  • if your internet connection suddenly drops, restic keeps retrying and recovers automatically
  • if your backup source disappears during backup, restic stops; restarting backup after reconnecting backup source didn’t continue where it left off but it nicely backed up again; (duplicate) data blobs already written were removed with prune

All together a very good and resilient backup tool!

Btw, adding the --json option for the restic backup command would be very handy (i.e. adding the ability to run it non-interactively using a script). The --json option already works for the snapshots and stats command, which is very powerful for scripting.

Since restic 0.9.5 the --json flag is actually supported by the backup command and produces a stream of json messages.

It’s generally recommended to run the check command from time to time to ensure that no data corruption (e.g. due to hardware problems) has sneaked in.

Just trying to understand the behavior here. Is there any confirmation when you restart a backup (say after a Ctrl-C) that it is restarting an aborted one? When I do that it just looks like it is starting from the beginning. Also does it hold a lock on the repo for an aborted backup? When I try to run a check on it has a fatal error that there is a previous lock on it.

It is not re-starting or continuing the previous backup - it has no knowledge of it.

It does a new backup run, but will find that some of the data was already uploaded, and not upload that again. It will scan files the same way, but simply not upload the blocks what were already uploaded.

It’s possible that there’s an old lock that was never deleted. To remove it, you can simply run the restic unlock command (with appropriate flags for your repository, of course).

2 Likes

It is not re-starting or continuing the previous backup - it has no knowledge of it.

Ok I see. Perhaps that would be a nice feature to add. I.e. in the best case some sort of progress file is written that lets future backups start from where the last one left off. Duplicacy has this feature. An interrupted backup doesn’t have to restart from the beginning.

I can see it being pretty difficult and slow if you were trying to backup something large to a remote repository and the connection kept failing and it has to redo everything. Its nice that it won’t have to actually send data it already sent though.

For example I’m doing around 1 TB backup just to an external drive and it was taking 24+ hrs when I had to cancel (computer crashed…) and now it starts back at the beginning.

This can also impact the speed of other applications while its running, so I was just doing a shell suspend (Ctrl-Z) but that might not always be possible (also might still have a lot in memory?).

Is there a workaround or strategy that helps with this?

Also regarding the locks. A new backup has no problem with obtaining the lock (without manual intervention), so somehow it does have some knowledge of the previous backup attempt.

All that has to be “done again” is the scanning of the files. This will just load your disk a bit, which can be a problem or not (on most systems it’s not an issue, but if you have a slow disk or filesystem, sure).

Is this an actual problem for you? What are you backing up from (the disk/filesystem where the files are scanned)? The absolute majority of the backup process will normally be spent sending the data, the scanning isn’t that heavy.

Your 24+ hour backup will just scan the files and start sending data where it was cancelled.

Well, restic would be able to see the lock file, that’s true. Won’t do much with it though.

It is my whole $HOME folder and some files changed since the previous scanning. Perhaps that is the problem?

I don’t know what to expect here in terms of how long scanning should take, but it was around 10 minutes and the % indicator was only around 4%. Previously around 650 GB of 1 TB had been transferred.

I was just backing up an internal HDD to an external HDD via USB.

The progress percentage is calculated from processed_bytes/total_bytes. Until the first backup is finished restic has to read every file again. 4% in 10 minutes equals roughly 70MB/s which some somewhat reasonable for an internal HDD. What surprises me a bit is that your backup of 1TB took more than a day to backup to an external HDD (I would have expected something around 5-10 hours).

There’s already an issue in the bug tracker for restarting interrupted backups, see #2280.

Old lock files are mostly relevant for the prune and check commands. These require exclusive access to the repository which is checked using the locks. Multiple backup operations don’t conflict with each other, so these locks don’t interfere.

1 Like

What surprises me a bit is that your backup of 1TB took more than a day to backup to an external HDD (I would have expected something around 5-10 hours).

Not sure what I can say there. I’m going to try again and I’ll report back. Does the type of file have any drastic effect on slowing it down? Of the 1TB most of it is just a few very large files (big HDF5 files of lots of arrays).

For large files restic should be able to backup as fast as it can read / write from the HDDs. If a backup is slow then it’s usually because it contains lots of small files or when “uploading” to the backup repository is slow. You could try whether setting the environment variable GOMAXPROCS=2 helps or not. That limits restic to use 2 CPU cores, but for any somewhat modern CPU that should be more than enough to keep the HDDs busy.

After my restart I got this report:

Files:       1656403 new,     0 changed,     0 unmodified
Dirs:            1 new,     0 changed,     0 unmodified
Data Blobs:  114624 new
Tree Blobs:      2 new
Added to the repo: 82.489 GiB

processed 1656403 files, 1.295 TiB in 8:26:46
snapshot 69b0e0ec saved

Like I said 650 GB had already been copied. Not sure how the math adds up here though… I think I need to prune it.

You could check whether the target drive uses SMR, which can have a huge impact on write performance. Many external drives use SMR because the manufacturers know that users don’t expect external drives to be fast.