Restic Cache - Best Practices?

jimp · April 29, 2019, 4:45pm

What is the best way to manage Restic’s cache? On the servers I’ve been using with Restic, it seems the cache grows indefinitely. For example, I have one VM with 20 GB of data, and while tracking down disk usage I found Restic’s cache had 85 GB in it after a few months of daily backups 4x per day. The schedule is set to keep daily backups for a week, then monthly for a year. However, I rarely run prune on the backup. A similar usage pattern exists for the other servers, and their caches grow unbounded as well.

I eventually delete the cache folder manually, but it’s not like Restic restores it on the next run. If I delete 30GB it doesn’t return on the next run or even dozens later. I know something in the cache is needed, but overall it feels like there’s a bug with it growing indefinitely. So what is the best way to manage this? Am I doing something wrong?

(Using restic 0.9.5 on linux/amd64)

Dj0k3 · April 29, 2019, 6:34pm

I don’t know a lot but I think restic cache will depend on what’s actually in the repository. If you execute prune I think restic cache will clean itself. Using forget alone will remove snapshots but it will not remove the data from the repository; that’s why prune exists. If you are just backing up without forget and prune, the cache will increase in size because you are only adding data to the repository, so the cache will need to keep all cache since you started making backups until now. You could run restic cache --cleanup and if there is old cache directories, this command will remove those old cache directories. It seems weird, tho. I’m also using 0.9.5, my repository contains (raw-data) 200GiB aprox (including a couple of VM’s with 80GiB for all of them) and my cache dir is 536M only.

Keep in mind that if you run restic while the VM’s are running, restic will run and will do its work but since machines being running keep changing, the backup process will take more time and the repository will always increase, so I assume that restic cache will also increase. I you want to take better reliable snapshots it is better to stop the VM’s, take a snapshot and then resume the machine.

A little “trick” I use is to run a script with cron and in that script I use if to determine if “X” VM is running, then exit so I don’t end up with a bunch of useless data (at least for my use case is useless; I just use these machines for testing and some work that requires me to use other OS) in my repository.

I share what I use in my script in case it is something useful for you.

if [[ "$(VBoxManage list runningvms | wc -l)" -gt "0" ]] ; then
  echo "["$(VBoxManage list runningvms | sed -e 's/{[0-9a-zA-Z-]*}//g' | sed -e 's/"//g' | sed -e 's/ //g')"] VM is running..."
  exit
fi

cdhowie · April 29, 2019, 6:48pm

There’s no need to stop the VMs if they live in a logical volume and the volume group has free space. You can take an LVM snapshot and use that as the basis for the backup – which you should do anyway, since restic backups are not atomic.

For example, if your servers run a database service and you are backing up the datadir with restic (as opposed to dumping the database using the engine’s dump tool) then you must use LVM snapshots or the datadir will be corrupt.

jimp · April 29, 2019, 9:16pm

Good tips, thanks, I am backup snapshots whenever possible.

When I run restic cache --cleanup it basically tells me there’s nothing to clean up. The docs say it will cleanup “old” cache directories, but I always have one from the beginning of time. It never gets past “0 days” since it was last used, so it is never eligible for deletion. Should there be multiple directories appearing in there?