Can I change pack-size after first backup?

Hello guys,

I’m wondering if we could change the pack-size after having started a backup.
Currently, we are trying to backup 85TB of image files to an S3 bucket and we started to face some challenges. The backup got extremely slow (0.5G per day) after uploading 23TB.
After some digging on the forum and docs, I saw that restic has introduced a parameter pack-size. The backup was interrupted a few times by now (didn’t complete the first snapshot). Therefore, I’m wondering if we could adjust pack-size to 128 MiB (this would be a good value for us since the average file’s size are 60 MiB) and resume the backup without any problem. Or is it better to restart it from scratch (clean the bucket)?

Thank you in advance.

Restic will work correctly even if you mix multiple pack size. However, this might result in having more files stored in the repository than necessary. (The pack size parameter is currently only available in beta builds, not in a released version). Which restic version are you currently using?

I don’t see the connection to small pack files here.

For the expected repository size, the large pack size is definitely useful. However, this is mostly independent from the average file size: restic cuts all files into smaller pieces of approx. 1MB, checks for duplicate data and then combines many of them into a pack file.

Thank you for the clarification. Yes, I know the pack-size is in beta. Currently, we are using version 0.13.1. Therefore, I was wondering if we could wait or build from the source the new version to tune some new parameters that will be introduced.

I agree. I’m not sure why it got so slow, but it started after we had to interrupt de backup. Then restic would run for about 3 days without uploading much data do S3 and exit with an error i.e:

18.08.2022 20:00:02: Backup /mnt/digi-backup-ref-digi to REF_DIGI started.

21.08.2022 02:28:15: open repository
lock repository
load index files
no parent snapshot found, will read all files
start scan on [/mnt/digi-backup-ref-digi]
start backup on [/mnt/digi-backup-ref-digi]
scan finished in 12919.004s: 6993679 files, 81.543 TiB

21.08.2022 02:28:16: unable to create lock in backend: repository is already locked by PID 144384 on PMFS-BLHA-BA01V by root (UID 0, GID 0)
lock was created at 2022-08-21 02:25:03 (3m12.6896949s ago)
storage ID c2672cbc
the `unlock` command can be used to remove stale locks

The backup in this example started on the 18th and ended on 21th with the given output. Before the backup started I unlocked all locks by running restic unlock. So, I don’t understand why it says the repository was locked 3m ago. So far we have been using restic without any problem (it works wonderfully) for backing up databases and file systems with GiBs. Only for those with TiBs, we have started to face problems.
My guess is that something went wrong on the interruption and restic can’t resume it.

In that case, restic first had re-read, re-chunk and re-hash all already saved data just to see that is already present.

There is an experimental PR which allows to “resume” backups. “Resume” means that additional to using already saved data, also trees which are fully saved are used as parent trees to prevent the mentioned re-reading of files.

But note that this PR saves additional data in the cache to be able to resume, so it doesn’t help if you resume a backup which has been made without this PR.

That log snippet looks strange. We’re seeing the output of multiple commands.

21.08.2022 02:28:16: unable to create lock in backend: repository is already locked by PID 144384 on PMFS-BLHA-BA01V by root (UID 0, GID 0)

I’m not sure which command that was exactly. But it is definitely not backup. In addition with no output after scan finished it looks like the backup command might still be running?

There is also something odd about the log timestamps: how can 21.08.2022 02:28:15: open repository and 21.08.2022 02:28:16: unable to create lock in backend [...] have essentially the same timestamps when they appear to have run several hours apart?

Depending on how fast restic is able to read the data and upload it, it can take multiple days for a single backup run to complete. e.g. at 100MB/s I arrive at roughly 10 days for 90TB. You might want to split the backup into smaller tasks (on the order of 10TB) if that’s possible without too much effort.

Yes, that was our expectation. With the start rate we had, we were expecting the backup to conclude in 15 days. Unfortunately, we had to interrupt it and we weren’t able to resume it since. Breaking in 10TB is not really an option for us :frowning:

Hun maybe the lock error is coming from restic forget.However, this should run only after restic backup is done. Below is the bash script that we are running.

Get-Date () {
  local func_result=$(/bin/date "+%d.%m.%Y %H:%M:%S")
  echo "$func_result"
}

Write-Status () {
  local status=$1
  local msg=$2

  if [ $status -ne 0 ]
  then
        echo "$(Get-Date): ${msg}"$'\n'>> $log_error_file
        echo "$(Get-Date): Backup Aborted with errors. See ${log_error_file} for more details." >> $log_out_file
  else
        echo $'\n'"${msg}"$'\n' >> $log_out_file

  fi

}

Backup-Dir () {

  local backup_path=$1
  local repo_folder=$2
  local tag=$3
  local restic_repository="${s3_bucket}${repo_folder}"

  echo "$(Get-Date): Backup ${backup_path} to ${repo_folder} started." >> $log_out_file
  export RESTIC_REPOSITORY=$restic_repository

  msg=`restic backup $backup_path --tag $tag --verbose  2>&1`
  Write-Status $? "${msg}"
  msg=`restic forget --tag $tag --keep-daily $keep_daily --keep-weekly $keep_weekly 2>&1`
  Write-Status $? "${msg}"

  echo "$(Get-Date): Backup ${backup_path} to ${repo_folder} ended."$'\n' >> $log_out_file
}

#Add backups here with the following parameters:
# folder_to_backup, restic_repo_folder, tag

Backup-Dir "/mnt/digi-backup-ref-digi" "REF_DIGI" "REF_DIGI"

I’d strongly recommend to first log which command is about to be executed and then run it afterwards.
It also might be a good idea to directly append restic’s output to the logfile without capturing it first. My feeling is that this could somehow be related to the behavior we’re seeing.

To incrementally create the snapshot you could also try the following: first run restic for a part of the data set and then let it create a (partial) snapshot. Then run it for a larger part of the data set (all previous files + plus a bunch more) and manually specify the id of the previous snapshot as --parent $snapshotId. Repeat until the full data set is backed up.

@MichaelEischer Thank you for your support so far, it has been very helpful.
I would like to give you a follow-up on the problem. Basically, it happens when you resume a big backup (scale of TiB).

Restic was running for about 3 days then it exited with no error msg. It turned out that the PID was killed due to out of memory error. Currently, the backup server has 8GB ram and 2 CPU cores.

It seems the error was fixed by setting a more aggressive garbage collector parameter in GO export GOGC=10, the default is 100.

When the system runs out of memory, the kernel usually sends a sigkill to a program which uses lots of memory. It is not possible to handle that signal in restic. It might be possible for the bash scrip to collect more information, but there’s nothing we can do in restic. (Except optimizing the memory usage)