Can I change pack-size after first backup?

tjunqueira · August 18, 2022, 8:29am

Hello guys,

I’m wondering if we could change the pack-size after having started a backup.
Currently, we are trying to backup 85TB of image files to an S3 bucket and we started to face some challenges. The backup got extremely slow (0.5G per day) after uploading 23TB.
After some digging on the forum and docs, I saw that restic has introduced a parameter pack-size. The backup was interrupted a few times by now (didn’t complete the first snapshot). Therefore, I’m wondering if we could adjust pack-size to 128 MiB (this would be a good value for us since the average file’s size are 60 MiB) and resume the backup without any problem. Or is it better to restart it from scratch (clean the bucket)?

Thank you in advance.

MichaelEischer · August 19, 2022, 3:48pm

Restic will work correctly even if you mix multiple pack size. However, this might result in having more files stored in the repository than necessary. (The pack size parameter is currently only available in beta builds, not in a released version). Which restic version are you currently using?

I don’t see the connection to small pack files here.

For the expected repository size, the large pack size is definitely useful. However, this is mostly independent from the average file size: restic cuts all files into smaller pieces of approx. 1MB, checks for duplicate data and then combines many of them into a pack file.

tjunqueira · August 23, 2022, 8:43am

Thank you for the clarification. Yes, I know the pack-size is in beta. Currently, we are using version 0.13.1. Therefore, I was wondering if we could wait or build from the source the new version to tune some new parameters that will be introduced.

I agree. I’m not sure why it got so slow, but it started after we had to interrupt de backup. Then restic would run for about 3 days without uploading much data do S3 and exit with an error i.e:

18.08.2022 20:00:02: Backup /mnt/digi-backup-ref-digi to REF_DIGI started.

21.08.2022 02:28:15: open repository
lock repository
load index files
no parent snapshot found, will read all files
start scan on [/mnt/digi-backup-ref-digi]
start backup on [/mnt/digi-backup-ref-digi]
scan finished in 12919.004s: 6993679 files, 81.543 TiB

21.08.2022 02:28:16: unable to create lock in backend: repository is already locked by PID 144384 on PMFS-BLHA-BA01V by root (UID 0, GID 0)
lock was created at 2022-08-21 02:25:03 (3m12.6896949s ago)
storage ID c2672cbc
the `unlock` command can be used to remove stale locks

The backup in this example started on the 18th and ended on 21th with the given output. Before the backup started I unlocked all locks by running restic unlock. So, I don’t understand why it says the repository was locked 3m ago. So far we have been using restic without any problem (it works wonderfully) for backing up databases and file systems with GiBs. Only for those with TiBs, we have started to face problems.
My guess is that something went wrong on the interruption and restic can’t resume it.

alexweiss · August 23, 2022, 10:06am

In that case, restic first had re-read, re-chunk and re-hash all already saved data just to see that is already present.

There is an experimental PR which allows to “resume” backups. “Resume” means that additional to using already saved data, also trees which are fully saved are used as parent trees to prevent the mentioned re-reading of files.

github.com/restic/restic

backup: Add resuming from aborted backups

restic:master ← aawsome:backup-resume

opened 01:10PM - 16 Jan 21 UTC

aawsome

+717 -133

What does this PR change? What problem does it solve? -------------------------…---------------------------- This is a follow-up of #3229. In this PR I made a PoC using the idea of user odin from the forum, see link below. With this PR, restic writes a temporary file (in the cache dir to the selected repo, subdir /resume/) which contains the list of already finished directories and the tree ID where it is saved in the repo. If the backup suceeds, the file is removed. At the start of a backup, restic tries to read the temporary file (if existing) and uses the contained trees as "additional parent trees" for the given directories. This means that w.r.t. already saved trees, this resumed backup is as fast as a follow-up backup using a parent snapshot. Note that this PR relies on #3121 which should be merged first. Was the change discussed in an issue or in the forum before? ------------------------------------------------------------ closes #2280 alternative to #2960 https://forum.restic.net/t/quicker-interrupted-backup-resumption/3470/6 Checklist --------- - [x] I have read the [Contribution Guidelines](https://github.com/restic/restic/blob/master/CONTRIBUTING.md#providing-patches) - [x] I have enabled [maintainer edits for this PR](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/allowing-changes-to-a-pull-request-branch-created-from-a-fork) - [ ] I have added tests for all changes in this PR - I have not added documentation for the changes (in the manual) - there is also none for parent snapshots - [x] There's a new file in `changelog/unreleased/` that describes the changes for our users (template [here](https://github.com/restic/restic/blob/master/changelog/TEMPLATE)) - [x] I have run `gofmt` on the code in all commits - [x] All commit messages are formatted in the same style as [the other commits in the repo](https://github.com/restic/restic/blob/master/CONTRIBUTING.md#git-commits) - [x] I'm done, this Pull Request is ready for review

But note that this PR saves additional data in the cache to be able to resume, so it doesn’t help if you resume a backup which has been made without this PR.

MichaelEischer · August 23, 2022, 7:34pm

That log snippet looks strange. We’re seeing the output of multiple commands.

21.08.2022 02:28:16: unable to create lock in backend: repository is already locked by PID 144384 on PMFS-BLHA-BA01V by root (UID 0, GID 0)

I’m not sure which command that was exactly. But it is definitely not backup. In addition with no output after scan finished it looks like the backup command might still be running?

There is also something odd about the log timestamps: how can 21.08.2022 02:28:15: open repository and 21.08.2022 02:28:16: unable to create lock in backend [...] have essentially the same timestamps when they appear to have run several hours apart?

Depending on how fast restic is able to read the data and upload it, it can take multiple days for a single backup run to complete. e.g. at 100MB/s I arrive at roughly 10 days for 90TB. You might want to split the backup into smaller tasks (on the order of 10TB) if that’s possible without too much effort.

tjunqueira · August 24, 2022, 1:17pm

Yes, that was our expectation. With the start rate we had, we were expecting the backup to conclude in 15 days. Unfortunately, we had to interrupt it and we weren’t able to resume it since. Breaking in 10TB is not really an option for us

Hun maybe the lock error is coming from restic forget.However, this should run only after restic backup is done. Below is the bash script that we are running.

Get-Date () {
  local func_result=$(/bin/date "+%d.%m.%Y %H:%M:%S")
  echo "$func_result"
}

Write-Status () {
  local status=$1
  local msg=$2

  if [ $status -ne 0 ]
  then
        echo "$(Get-Date): ${msg}"$'\n'>> $log_error_file
        echo "$(Get-Date): Backup Aborted with errors. See ${log_error_file} for more details." >> $log_out_file
  else
        echo $'\n'"${msg}"$'\n' >> $log_out_file

  fi

}

Backup-Dir () {

  local backup_path=$1
  local repo_folder=$2
  local tag=$3
  local restic_repository="${s3_bucket}${repo_folder}"

  echo "$(Get-Date): Backup ${backup_path} to ${repo_folder} started." >> $log_out_file
  export RESTIC_REPOSITORY=$restic_repository

  msg=`restic backup $backup_path --tag $tag --verbose  2>&1`
  Write-Status $? "${msg}"
  msg=`restic forget --tag $tag --keep-daily $keep_daily --keep-weekly $keep_weekly 2>&1`
  Write-Status $? "${msg}"

  echo "$(Get-Date): Backup ${backup_path} to ${repo_folder} ended."$'\n' >> $log_out_file
}

#Add backups here with the following parameters:
# folder_to_backup, restic_repo_folder, tag

Backup-Dir "/mnt/digi-backup-ref-digi" "REF_DIGI" "REF_DIGI"

MichaelEischer · August 24, 2022, 8:35pm

I’d strongly recommend to first log which command is about to be executed and then run it afterwards.
It also might be a good idea to directly append restic’s output to the logfile without capturing it first. My feeling is that this could somehow be related to the behavior we’re seeing.

To incrementally create the snapshot you could also try the following: first run restic for a part of the data set and then let it create a (partial) snapshot. Then run it for a larger part of the data set (all previous files + plus a bunch more) and manually specify the id of the previous snapshot as --parent $snapshotId. Repeat until the full data set is backed up.

tjunqueira · September 12, 2022, 8:27am

@MichaelEischer Thank you for your support so far, it has been very helpful.
I would like to give you a follow-up on the problem. Basically, it happens when you resume a big backup (scale of TiB).

Restic was running for about 3 days then it exited with no error msg. It turned out that the PID was killed due to out of memory error. Currently, the backup server has 8GB ram and 2 CPU cores.

It seems the error was fixed by setting a more aggressive garbage collector parameter in GO export GOGC=10, the default is 100.

MichaelEischer · September 13, 2022, 6:47pm

When the system runs out of memory, the kernel usually sends a sigkill to a program which uses lots of memory. It is not possible to handle that signal in restic. It might be possible for the bash scrip to collect more information, but there’s nothing we can do in restic. (Except optimizing the memory usage)