VirtualBox machine backup

I’m trying to backup a guest Ubuntu server on VirtualBox hosted by a OS X. As it’s little team server it can afford being down a few minutes (even an hour) during the night.

I’ve learned from experience that backing up a virtual machine while it’s running is a non sens as the files inside the guest machine would be inconsistent, particularly MySQL or PostgreSQL databases. To get a consistent backup you have to stop (or pause) the virtual machine (that’s the case for both VirtualBox and VMWare).

So I’m stopping the machine, export it as an .ova file (a tgz which content is everything you need to rebuild a VM) and reboot the machine. The whole process takes a tenth of minutes which I can afford.
The problem is the tgz is created every day during the night. Restic sees it as a brand new file, certainly cause its creation date is new. More, even its content differs from the previous due to gzip compression. So backing up this ova/tgz file is not a good solution, it wont benefit from deduplication.

I tried to extract the ova/tgz file (which gives vmdk files) to benefit from dedup and speedup restic. But again that is not the solution as every day the vmdk files are seen by restic as brand new files (certainly due to creation date and inode).

I took a step back and thought I could just stop the VM and do my backup with restic, but restic is giving me a ETA of 20 hours.

I don’t know what’s done by VirtualBox when you shutdown the guest system, the running machine was already backed up, so these 20 hours means there’s a huge difference between running and stoped instances.

In short
1- backing up a running VM is non sens
2- Exporting the VM as ova makes a new huge (15GB) file every day (means many hours)
3- Extracting the ova makes new vmdk files every day (same many hours)
4- stopping the machine and doing a restic backup asks for 20 hours even if the running machine is already in the repo.

I also thought that if I had a first backup the forth solution would be much faster on the second run, so I waited for 20 hours, then

  • Booted up the virtual machine,
  • made a simple sudo apt-get update followed by an apt-get upgrade which told me there were only one 300KB package to upgrade with no new space eaten from the disk space.
  • Stopped the virtual machine
  • Launched the same restic backup command …
    This was two hours ago and restic is telling me to wait for 11 more.

So I’m stuck, if anyone has an idea…

Can you share some details about system usage (CPU, memory, hard disk) during backup? Which backend are you using? Are you saving to local disk?

I think you are overcomplicating this very much. Just back up the VM from inside it, using restic.

To be clear; Inside your VM, schedule a backup using restic, to back up the data and other relevant files in the VM that you want to be able to restore if needed.

If you ever need to restore, you can just create a new VM, install Ubuntu in it, then restore the files from yoru restic backup.

No need to shut down the VM, export it, and what not. Just back it up like it was a regular physical machine.

Where are you trying to send this backup? What’s the size of your VM? Are you backing up something else when you’re doing the backup or just this VM? I use VirtualBox and I have two VM’s (54.1GB) and a lot of documents but they’re small; total amount of data I’m backing up is 68.4GB and the first run (including the two VM’s) took about like 2 hours via sftps in a small home server in my local network (RPi); however, to back up 6GB to a remote sftps server took the same amount of time (withouth the VM’s) and snapshots takes between 20 and 30 minutes (including check, forget and prune).

If your VM is like 15GB something is weird. The one time it takes longer for me to backup my VM’s is when I’m working on them and that’s because they’re not Off. You could use restic from the VM, like @rawtaz said too. It’s easier.

Sure, sorry…
So the host is a
Core i7 at 3.5 Ghz with
16GB RAM.
The Hard disk is 1TB CoreStorage (means its a syndication of a small SSD and a big HD). I looked into this and as far as I understood, the CoreStorage layer doesn’t change anything to the filesystem, it’s between filesystem and hardware.

The guest is Ubuntu 16.04
4GB Ram. 20 GB disk.
Disk is vmdk cause it’s easier to resize.
Inside the vmdk partition is ext4.

Destination is S3 at Wasabi.
And for example, an hourly backup during the night, even if my uplink is very slow gives
Files: 0 new, 34 changed, 286200 unmodified
Dirs: 0 new, 1 changed, 0 unmodified
Added to the repo: 3.942 MiB

processed 286234 files, 204.496 GiB in 4:10
snapshot 4c2e3e8d saved

If I exclude the vmdk files.

Backing up from inside the VM means a lot more work than restoring a file which contains everything, more when you have some special server running on it.
I also maintain another backup which is only the dump of the Postgresql database, but I want to minimize the restoration time.
In case of catastrophy, rebuilding the machine could cost me at least half a day not to say a day, on the other hand, restoring the whole virtual server at once takes about an hour or so.

I’ve had a lots of success doing exactly what you attempted to in your step 4, and I agree it’s by far the simplest method if restore is ever needed.

Now, since it works so well for me, some minutes to backup full 50GB VM, and that is also going to remote rest server in another country, you should try to understand what is so slow in your use case.

Restic WILL need to reread the whole VM image on every backup, but it will only save/send changed blocks, thanks to deduplication.

Do same baseline tests, just read the whole file, send it to /dev/null and see how long it takes? Do you use cache? It’s a must for remote repos and immensely speeds up many operations.

That’s what I thought. Yeah, doing full backup of a VM (and by that I mean backing up the VirtualBox folder) is really the best way of backing up a VM. Backups programs I used in the past didn’t did a great job with VM’s and I had to build up the machines again. I had backups for the data but it is really too much work. I prefer to backup the whole VM’s folder now and restic works great with VM’s.

I don’t include my VM’s for my backup in my remote server because of that same reason. You could do this: backup the VM into a local disk or another local computer; maybe even in the Host if it’s always on. Then you can sync your repo using rclone. That way your backup will be much faster and you can let rclone in the background running and doing the sync and keep working in the server once the backup finished. I don’t know if this sounds good for you or maybe there are other solutions that I can’t think of right now, but you could try it and see how it works. Maybe it can even upload faster because you’ll be working with the repository locally which is so much faster and rclone will take the sync process, something that rclone do really, really good.

As this week is coming to an end, I actually ran my VM backup right now and these are the results:

Files: 0 new, 1 changed, 0 unmodified
Dirs: 0 new, 2 changed, 0 unmodified
Added to the repo: 10.030 GiB

processed 1 files, 48.000 GiB in 6:19

So, it took slightly more than 6 minutes to read 48 GiB VM image, find that 10 GiB of data has changed in the past week, and send those changes to another country for backup purposes (to a very cute €4/mo storage VPS dedicated to all my backups).

As already mentioned, on the remote side is rest server with nginx frontend (to provide SSL termination and authentication) and I find this setup pretty well working and quite performant. It’s also a no brainer to do a disaster recovery of the VM, and I did it successfuly once in the past. It took quite a long time, as restic restore operation is still not well optimized, especially for restoring from remote repos (with additional latency) like this one. But it worked.

To sumarize, @Vartkat your step no 4 is the sound strategy, you just need to find out what is making it so slow in your case. Is it slow disk, slow network or something else… FTR, restic 0.9.2 here, official Debian buster package.

Ideally, you do the following sequence:

  • Pause the VM
  • Take a filesystem snapshot on the host
  • Resume the VM
  • run the restic backup from the snapshot filesystem
  • delete the snapshot

No idea if your setup supports this.

My uplink is very slow (about 150 Kbits max) but even with such speed I’m getting good delays when I exclude the VM.
@ zcalusic you were right, another process was eating up my bandwith, so I stopped it.

I will try @wscott method and let you know.

Thanks for all these suggestions.

If you have a slow uplink and snapshotting the file system is not possible/wanted, you could aswell do a fast backup to local storage and push this later (e.g. with rclone) to the remote storage.

After my investigation I discovered that I had an old process kickstarting my VM at 3:30 in the morning. That resulted in an inconsistent vmdk file as it was nor stopped nor running.
I beleived I already have a complete stopped machine, that was never the case as the virtual machine was changing state during backup.

Things are getting better as my 20GB machine takes less than an hour to be backed up (even if at start, restic estimates the ETA to more).
I also tried @wscott solution, but the paused state seems to be a third case as it was announcing more than 30 hours, something close to backing up the whole file.
So I’ve been from

Files: 8 new, 0 changed, 0 unmodified
Dirs: 3 new, 0 changed, 0 unmodified
Data Blobs: 2425 new
Tree Blobs: 4 new
Added to the repo: 1.746 GiB

processed 8 files, 19.573 GiB in 9:01:49
snapshot 8d913f93 saved

to

Files: 0 new, 8 changed, 0 unmodified
Dirs: 0 new, 3 changed, 0 unmodified
Data Blobs: 422 new
Tree Blobs: 4 new
Added to the repo: 277.330 MiB

processed 8 files, 19.574 GiB in 40:00
snapshot 4434e5e9 saved

which make more sens even if these 277 MB are much more than what is really added inside the virtual machine (a few tenth of MB).
As I read that you can what’s inside a vmdk file with 7Zip, I suppose vmdk is, in a maner, a compressed format.

Finally my method is to stop the virtual machine, backup, then start it again.
Hope all these attemps would be usefull to someone else.

FWIW, the pause/resume step is unnecessary if the filesystem snapshot is truly atomic (such as an LVM snapshot). Pausing the VM doesn’t actually accomplish anything.

I am sorry, but that just isn’t true. It is like says that if you have a desktop machine running some program and you could magically copy the hard drive in an instant to another duplicate machine when you turn on that duplicate machine it would be running the same program at the same place.

It would behave as if you yanked the power from the wall and then plugged it back in. Most OSs would recover from that and discard any files are were in the process of being written, but you do lose state.

That said doing this is a backup of the VM and would mostly save the files in the VM, but you do run a risk of corruption. Especially if that VM was running something like a database.

We all backup VMs without pausing them first. I do too, but it is important to understand that there is a risk.

While pausing the VM might be better than a LVM snapshot it is still not safe. The only safe way to back up a VM is to shut it down.

OK, yeah, your right. But you can actually leave it running:

  1. Save a VM snapshot
  2. Run backups
  3. Remove the VM snapshot

Then if you need to restore you might need to resume the saved snapshot to get back all the data.

What if the VM is running database as MySQL or Postgres with cached data ?

And pausing does not fix that situation unless by “pause” you mean “suspend state to disk.” In VirtualBox, all pause does is temporarily halt execution, but the memory contents of the VM are still only in system RAM and haven’t been written to disk. This is useful if you aren’t taking a filesystem/block-level snapshot on the host, but are instead backing up the guest disk file(s) directly as it prevents writes during your backup (which can result in corruption that does not look like a power cut).

Pausing doesn’t assist host-based filesystem/block snapshot mechanisms in any way since they can already take atomic snapshots of the host disk state.

“Close and save state” might be the option you’re thinking of (write the machine state including RAM to disk and stop the VM), but this requires a more significant amount of downtime. As a side effect, this can also leak sensitive memory contents of the guest (cryptographic key material, for example) if you aren’t using encryption features of your hypervisor (e.g. VirtualBox VM encryption).

Indeed. If you’re running a production database, this might also be fine. I can’t think of a production-quality database that doesn’t do its own journaling specifically to recover from an unexpected power cut.

As far as “losing state” that’s kind of a silly argument, IMO. In the context of a DB server for example: data that the DB server has claimed to have stored (returned success to the client) is guaranteed to be persisted in the VM image on disk in some form (directly in the DB data files, or in some write-ahead log). In that case, the application perceives no data loss.

Yes, you throw out some amount of unprocessed data, but we understand that backups are point-in-time, yes? Backups are not meant to persist changes that effectively haven’t happened yet. We have to draw a line in time somewhere and say “changes made before this line are backed up” and it’s a bit unreasonable to say that the backup process is defective because changes that haven’t been committed yet will get rolled back. That’s the whole point, no?

Obviously, this depends on what software you are running in the VM. The general rule of thumb is: if the guest software can handle a power cut correctly, the VM does not need to be suspended at all if your host can take a filesystem-level snapshot (ZFS, btrfs, Windows VSS) or block-level snapshot (LVM).

And I’d argue that if your services can’t handle a power cut correctly, they shouldn’t be used in production to begin with.