Restic Disk Usage Has Me Tearing My Hair Out

I’m backing up a 2.2 TB folder with its 144 subfolders to a new remote server mounted as an NFS volume. Each subfolder is a separate customer database, so I am calling restic 144 times, each with a different customer-specific tag, so I can manage and restore each database individually if necessary. After the initial set of restic runs, the restic repo folder on the remote server is 4.4 TB in size, even though the source and all the subfolders combined are only 2.2 TB. After the second day, the remote restic repo is 9 TB in size. The remote disk, which is 15TB, fills up after only a few days. What the heck is restic doing? Or maybe the better question is, what am I doing wrong?

Even if you were to back up all 144 database subfolders in one and the same backup run, you’d still be able to restore each of them individually if you wanted to. I don’t really see the point of creating one separate backup snapshot per database - just back all of them up in one go (and see Restoring from backup — restic 0.16.3 documentation for how to use the --include option of the restore command).

That’s very odd. It would be interesting to know what an initial backup of all subfolders in one backup run yields in terms of used disk space on the repository server.

This sounds like restic is unable to partially or fully deduplicate the contents you are backing up. Since you haven’t told us anything useful in terms of what you are actually backing up and how that data is created, it’s hard to point to much more than e.g. the --rsyncable option of gzip, in case you’re using something like that to compress the database dumps (if that’s what they are).

Hard to tell! Please specify the obviously relevant info, such as restic version, what backend you’re using, the exact command (including any environment variables) you use to run the backup(s), and what the data you’re backing up consists of and how it’s produced.

Thanks for the feedback.

restic version = restic 0.13.1-dev (compiled manually) compiled with go1.18 on linux/amd64

backend = not sure what you’re looking for here. The filesystems on the source and destination servers are ZFS with compression enabled. The underlying storage is NVME.

exact command (including any environment variables) = restic -p $PASSWORD_FILE --tag $SITE -r /restic_remote_repo.nfs --verbose backup /zpool0/db_rsyncs/$SITE

The above command is executed 144 times, once for each MySQL folder. The only thing that changes is the $SITE variable.

…where $SITE is a 3-digit number.
There are no environment variables.

data you’re backing up = 144 folders, each containing a complete MySQL folder structure, which was placed there by an rsync operation from the DB servers. The backup goes…

DB server → rsync to → storage server 1 → restic to → storage server 2

What’s the uncompressed size of the source data?

Is the data the raw mysql databases or a sql dump?

As you’re using 0.13.1-dev, is the repository using repository format version 1 or 2?

The uncompressed source data on the DB server is 4.8 TB.

They are not SQL dumps. There are simple directory copies taken with the MySQL instances all in a down state.

That’s a fair question. How would I tell?

Then this is probably what’s going on. Restic sees the data as uncompressed, which means it sees 4.8 TB. It backs this up, and on the remote side it stores a bunch of binary encrypted stuff, which is hardly very compressible by ZFS. So you end up with 4.4 TB.

That said, restic’s compression should be able to deal with this a bit I suppose, let’s see what Michael says.

The source data is 4.8 TB, but when it gets rsync’d to storage server 1, ZFS compresses it to 2.2 TB. If restic sends 4.8 TB of uncompressed data over the wire to storage server 2, I would expect ZFS on storage server 2 to do the same thing as storage server 1 did when it received the raw data from the DB server. The end result should still be 2.2 TB of used space on storage server 2.

It’s a further mystery to me that the Day 2 backup is twice the size of Day 1. It’s like there’s no incremental backup being done, just another full copy.

No, because the data is not the same when stored on server 2 as it is/was when stored on server 1.

  • On server 1, the ZFS sees compressible data and thereby successfully compresses it from 4.8 TB to 2.2 TB on disk.

  • On server 2, the ZFS there sees less compressible data (restic’s encrypted binary blobs et al), and thereby is not as successful in compressing it (but manages to do so from 4.8 TB to 4.4 TB on disk) - or it’s restic that compresses the data from 4.8 TB to 4.4 TB.

At least that’s my theory, but whether it’s correct or not depends on the answers to Michael’s questions :slight_smile:

1 Like

Recent restic dev build print the repository version when running a command, e.g. restic snapshot prints:

repository c881945a opened (repository version 1) successfully, password is correct

Version 2 supports compression, but that feature is for now only available in dev builds. But as it is opt-in for now, your repository is probably still using version 1. Especially as the ZFS compression shows that the data compresses reasonably well.

That is indeed strange. That would essentially require the db to change data in like every MB of the database, which sounds a bit much. How much uploaded data does restic report, when running the incremental backup?

Since there are 144 separate runs of restic each day, I will pick one that is reasonably large as an example. Here’s the log from Wednesday…

14:08:02: — site719 —
14:08:02: exec: restic --tag site719 -r /restic_remote_repo.nfs --verbose backup /zpool0/db_rsyncs/site719
14:26:44: open repository
14:26:44: lock repository
14:26:44: no parent snapshot found, will read all files
14:26:44: load index files
14:26:44: start scan on [/zpool0/db_rsyncs/site719]
14:26:44: start backup on [/zpool0/db_rsyncs/site719]
14:26:44: scan finished in 39.814s: 13124 files, 399.499 GiB
14:26:44: Files: 13124 new, 0 changed, 0 unmodified
14:26:44: Dirs: 12 new, 0 changed, 0 unmodified
14:26:44: Data Blobs: 754686 new
14:26:44: Tree Blobs: 13 new
14:26:44: Added to the repo: 383.849 GiB
14:26:44: processed 13124 files, 399.499 GiB in 18:41
14:26:44: snapshot ccb16896 saved

Note that it added about 400GB to the repo because it was the first run on a fresh/empty repo.

And here’s the log from Thursday for the same backup…

09:03:33: exec: restic --tag site719 -r /restic_remote_repo.nfs --verbose backup /zpool0/db_rsyncs/site719
09:09:08: open repository
09:09:08: lock repository
09:09:08: using parent snapshot ccb16896
09:09:08: load index files
09:09:08: start scan on [/zpool0/db_rsyncs/site719]
09:09:08: start backup on [/zpool0/db_rsyncs/site719]
09:09:08: scan finished in 66.770s: 13138 files, 402.077 GiB
09:09:08: Files: 15 new, 514 changed, 12609 unmodified
09:09:08: Dirs: 0 new, 8 changed, 4 unmodified
09:09:08: Data Blobs: 43682 new
09:09:08: Tree Blobs: 9 new
09:09:08: Added to the repo: 22.987 GiB
09:09:08: processed 13138 files, 402.077 GiB in 5:34
09:09:08: snapshot f52a2f5b saved

That one added only about 23GB.

And yet the restic_repo folder was 4.4 TB after the initial backup, and 9 TB after the second day.

What commands are you using when you inspect a) the compressed and uncompressed size of the source data on server 1, b) the compressed and uncompressed size of the restic_repo folder on server 2?

Also, how often do you run these 144 different backups?

I use df -h on the source database server. It has 144 separate filesystems, so I add up the “Used” amounts, Then I use du -hs on the db_rsyncs folder on storage server 1, and then du -hs on the restic_repo folder on storage server 2. I’m aware that there are differences in the way usage is calculated between df and du.

Every day.

So just once per day, or what?

Do you get any other results when you use zfs list instead?

zfs list has a lot of options. The basic command does not show much of use.

On storage server 1…

[root@store50b db_backup_stage2]# zfs list
NAME USED AVAIL REFER MOUNTPOINT
zpool0 7.72T 6.64T 7.72T /zpool0

On storage server 2…

[root@store50a zpool0]# zfs list
NAME USED AVAIL REFER MOUNTPOINT
zpool0 10.1T 4.26T 10.1T /zpool0

Are there any special options you’d like me to try?

Not in particular, I was mostly curious if you are seeing different numbers when you use the ZFS commands instead. The examples you gave doesn’t seem to match the sizes you’ve written earlier, so I’m not sure what to make of it.

You could run zfs get all|grep compress to see what you get out of that though. But please do so for the relevant data sets, I presume you have more than just the root/pool filesystem?


Anyway, I think that the entire discussion here is with only pieces of information revealed. In order to fully understand what you see, we need to see the same thing, which means e.g. the full logs of all runs involving the problem and the repository. As well as some more specific output from disk usage checks, I suppose.

But a more important thing is that instead of dealing with this somewhat big source and repository, IMO the sane way to debug this further is to do the following:

  1. Create a new folder the same way as the others on server 1, but with just some test data as the source to back up. This can be a copy of one of your current databases if you want to, but it would of course have to be somewhat big in order to make it easy to follow up on (e.g. if its disk usage on the repository server was to double, it’s easier to see that with a large size). It would also make sense to keep it to just one folder and one single backup job, so you don’t have to fiddle with multiple logs from different jobs, instead only focusing on one minimal test case.

  2. Create a new repository on server 2, the same way you created the other ones, but for this test source data explicitly.

  3. Back up the source data to the new repository the same way you back up the others, and collect logs as usual. Try to reproduce the problem.

  4. If you can reproduce the problem with this test data and test repository, do the very same thing but with the source data and destination repository being on “regular” filesystems such as ext4, to see if that makes any difference to the outcome on the repository side.

If there is a difference when you don’t use ZFS, that’s a good starting point to dig further into.

Generally speaking, restic backs up the data it sees, deduplicates (and compresses it, if you have that type of restic version), and then stores binary encrypted files on the repository side. It’s quite unlikely that it backs up more data than it detects having changed and that you see in the output of the backup runs, and it really doesn’t store more data than it sent on the repository side. So this is why it’d be good to isolate the matter by removing the ZFS component (assuming you can reproduce the problem in a controlled manner). Especially since you had a similar issue in this forum a while back, which was never fully understood or figured out.

It’s all about systematic isolation and testing to verify and draw conclusions, until you find the root cause.

3 Likes

On the source server, I created a single subfolder containing a MySQL directory tree. As you can see, the folder occupies 130 GB on disk…

[root@store50b zpool0]# du -hs db_rsyncs_2/site092
130G db_rsyncs_2/site092

On the destination server, I created a new restic repo…

**[root@store50a zpool0]# mkdir restic_repo_2 **
[root@store50a zpool0]# restic -r restic_repo_2 init
created restic repository 09f32ebfc1 at restic_repo_2

I added the new repo to the NFS exports and reloaded the NFS server service.

From the source server, I mounted the new repo as an NFS volume…

[root@store50b /]# mount -t nfs store50a:/zpool0/restic_repo_2 /restic_remote_repo_2.nfs

I then ran restic against the source subfolder…

[root@store50b zpool0]# restic --tag site092 -r /restic_remote_repo_2.nfs --verbose backup /zpool0/db_rsyncs_2/site092
open repository
repository 09f32ebf opened (repo version 1) successfully, password is correct
created new cache in /root/.cache/restic
found 1 old cache directories in /root/.cache/restic, run restic cache --cleanup to remove them
lock repository
no parent snapshot found, will read all files
load index files
start scan on [/zpool0/db_rsyncs_2/site092]
start backup on [/zpool0/db_rsyncs_2/site092]
scan finished in 1.502s: 12679 files, 293.054 GiB

Files: 12679 new, 0 changed, 0 unmodified
Dirs: 11 new, 0 changed, 0 unmodified
Data Blobs: 569363 new
Tree Blobs: 12 new
Added to the repo: 283.796 GiB

processed 12679 files, 293.054 GiB in 17:17
snapshot f4d3c9a7 saved

I note that restic is using repo version 1, and it saw 293.054 GiB of data.

On the destination server, the restic_repo_2 folder is now 285 GB in size…

[root@store50a zpool0]# du -hs restic_repo_2
285G restic_repo_2

For comparison purposes, I did a simple rsync of the source folder from the source to the destination server…

[root@store50b zpool0]# rsync -avh /zpool0/db_rsyncs_2/site092 store50a:/zpool0

On the destination server, that folder occupies 130 GB…

[root@store50a zpool0]# du -hs site092
130G site092

So… we find that the source folder is 130 GB, and when using rsync the destination folder is also 130 GB; however, when using restic, the destination folder is 285 GB.

No, there’s just the zpool0 root filesystem on both servers.

1 Like