Large backup not completing -- no output

Hi everyone,

I am trying to do a fairly large backup (55TiB) overnight in stages, however it’s not completing. I am using systemd timers to run the backup command (via resticprofile) every night at 20:00 and at 06:00 I am killing the service, as this is a production system we are backing up and needs to be used throughout the day.

Over the past few weeks, S3 is reporting it has 55TiB in the bucket, but each morning I check the logs and restic hasn’t finished, and is terminated at 06:00. And if I run snapshots there are no snapshots, and stats reports that there is no data there.

It’s been a while since I did another backup, that did work successfully, so I can’t even remember what the output is meant to be – but it appears it’s doing… nothing?

If it helps, this is a gluster volume, so originally I suspected slow metadata ops, but I don’t think that’s the issue, as one night I ran a ls over the whole directory first and it completed within an hour.

Does anyone have any ideas to diagnose the issue or get it going? If anyone has any ideas it’d be greatly appreciated!

I am not familiar with restic profile, but looks ok-ish. As a small thing, I’d suggest adding no-scan to it, which helped me on slow-access situations.

I did have that at one point. I had a lot more parameters, but stripped them all back in case any caused the issue. I’ll give it another go, as it can’t do any harm (hopefully), ty.

55TiB in 10 hours is about 1.5Gbyte/second, so i think your 10 hour run is not enough for Restic to finish scanning all data?
The data chunk upload will be taken care of, since restic will not re-upload any data it already transferred in previous runs.

But as long as the first backup (incl scan of all files) is not completed, the snapshot will not be created and restic will rescan all files next time you run it.

If you run stock Restic with the standard scan option it will show you an estimate of how far it is and how much still need to be done.
That should give you a good insight if Restic would be able to complete in 10 hours. Example:
image

If the answer on that is NO you have a few options, in random order:

  1. run restic once during the day and let it finish. you could opt a weekend for this?
  2. break the entire backup down in smaller chunks which can be completed in 10 hours. then use the
    --parent <previous_snapshot_id>
    option to chain the next backup to the previous which will prevent rescanning of all data. once the final part is done you can continue with normal backups without that option.

Hi,

Thanks for the detailed response! You’re right it can’t complete it in one go. My understanding was that if killed, Restic will resume where it was before, without scanning all the files all over again. Is this not the case? Perhaps this is only once it has done an initial backup?

My strategy has been to just kill the backup at 06:00 and continue again at 20:00.

We are running it over weekends too, the whole time, and it still hasn’t finished. Does the same thing in the output where just nothing happens.

Option #2 you mentioned does sound interesting. How would I go about doing this? I assume it’s all one repository; split the files into 10 directories, for example, and then how would I go about backing these up so once all 10 are done they can be merged?

Thanks!
Chris

P.S., if indeed it must scan files every time until the first backup, then it can be killed and continue, that’ll be where the problem may lie. The weekend schedule for the initial backup is: Friday at 20:00 to Monday at 06:00, which is 58 hours.

(55000/(3600*58))*8 = 2.1Gb/s. We achieve almost that to S3, but clearly not quite haha. So perhaps S3 is reporting 55TiB, but it’s actually NOT finished, just so so so close…

Is this possible?

Not sure what requirements do you have, but is it an option for you to create a filesystem snapshot (I think glusterfs can do that via lvm) and to run a restic backup of that snapshot without interruption?
The snapshot will ensure consistency of source data.
If CPU and network bandwidth are your concern, you could limit the usage of the restic cli with nice and ionice commands to an acceptable degree and let it run until it completes.

1 Like

@Chris1,
relevant manual section:
# Backing up - File change detection

relevant cmdline help info:

restic backup --help|grep "\-g"
  -g, --group-by group                         group snapshots by host, paths and/or tags, separated by comma (disable grouping with '') (default host,paths)
      --parent snapshot                        use this parent snapshot (default: latest snapshot in the group determined by --group-by and not newer than the timestamp determined by --time)

Splitting goes like this, pseudo code:

restic -r <your-repo> backup <data-root/subfolderN> --group-by host
[repeat for all subfolders; and data-root/ as a last step]

To run with your regular data-root/ path as last step is important. Once that is done then all backup paths should be known and incrementally included in the latest snapshot. You can verify it with
restic -r <your-repo> snapshots
Then you can run a subsequent ‘normal’ backup without --group-by host and you can enjoy a drasically faster backup.

p.s. you could also use --parent <snapshot.ID> but --group-by host is simpler.

1 Like

Hi all,

Thanks for the suggestions. In the future, perhaps doing an LVM snapshot (o.e.) for the first one would be a good idea for data consistency, and running it all at once. Very good idea, wish I had it haha.

Equally, doing it in chunks and merging later would be a very good way if volume snapshots aren’t possible.

But… somehow this weekend it finished? This makes me think it was the fact that if it’s the first backup, it doesn’t resume, but starts all over again. And, this time, we happened to hit the magic 2.1Gbps to S3 that we needed. Can someone confirm this is retic’s behaviour?

Thanks all for your brilliant ideas!

Oh and also, it still had no output? Perhaps it was doing some sort of consistency or file check scan thing before finishing the first backup?

@Chris1 good that you could make it to finish. Curious to hear what is the speed of your next backup run, it should go a lot faster.

You mention 2.1Gbps to S3, but that is no S3 bandwidth, it is your local required IO speed to read and process all data that has to be backed up. Ofcourse if you have pending data then it will add to this time as uploads will have to be completed. This is likely why before it could not complete, since only now almost (or all) data was already transferred into your bucket.

So in a way restic does not resume

  • restic will read all data from the start until a (parent) snapshot is established.

and in a way it does resume

  • restic will not re-upload any data to the backend repository that is already uploaded in previous sessions (regardless if they finished or not).

I cannot help with the lack of generating output since you are using resticprofile and not native restic.

Hi @GuitarBilly

Last night it once again didn’t finish (just logged in before 9 and had to kill it). I believe restic checks metadata first to check if a file may have changed. Looking at iftop it’s doing about 4Gbps from the storage, so is reading many files. If that’s the case it’ll never finish overnight as it’d need to do about 10Gbps constantly. Which, on this Gluster volume with hundreds of small files per directory is unlikely lol.

Thanks for the clarification on the re-uploading behaviour: i.e., it will not upload any data already there. But it may still read all the data, to establish which files need to go and which don’t.

It appears Restic uses this logic to decide if it needs to scan (I assume that means read) the file:

Modification timestamp (mtime).
Metadata change timestamp (ctime).
File size.
Inode number (internal number used to reference a file in a filesystem).

The files in this share are all managed by a media asset manager. I’ll see if I can find out what it’s actually doing to the files. I assume the Inode and file size remain the same. If I use --ignore-ctime then only the mtime would have to match. Let’s just hope the software isn’t changing the mtime every time someone accesses it…

A bit of a crude hack to finish the first backup:

  • run Restic in tmux
  • Each morning, press ctrl-z to pause the process
  • In the evening, type fg to resume

Sorry I’m in a hurry and can’t elaborate more at the moment, but maybe this helps