Single huge file incremental backup

z0mb1e-kgd · April 24, 2024, 12:14pm

I have a single-file database (much alike sqlite3) that is rather large (12 GB and growing). It is being changed every day, yet the changes being made are very little, with the majority of the file data being unchanged. I need to have it backed up every day and uploaded to a cloud drive, but the problem is the internet traffic limits are very sparce, so I need to arrange it in a way to lessen the backup traffic as much as possible, applying the strongest compression to both the base backup file and its increments. Am I able to achieve it with Restic? If so, how do I do it most properly?

JefW · April 24, 2024, 12:46pm

We do this with 6GB filemaker backup files.
They change ± 20MB a day.
We use the rest server to have append only.

Basic this:
#!/bin/bash
export RESTIC_PASSWORD_FILE=“/restic-pw.txt”
/usr/bin/restic -r rest:https://User:Password@Servername.org/BackupFolder backup /FilemakerBackupLocation --json | tail -15 >> /var/log/BackupRSDC.log

zcalusic · April 24, 2024, 1:46pm

If this holds true “yet the changes being made are very little, with the majority of the file data being unchanged” then the answer is yes, Restic is a very good choice, and will be very efficient transferring only what has changed every day.

Your use case is quite similar to backing up virtual machine disk image, also quite big, but changes a little every day, so is backed up efficienty with Restic.

Of course, the 12 GB file will need to be fully read from disk every day, so Restic can see what has changed since the last backup. And make sure that when you’re doing backup, that nobody else writes to the file, so that backup is consistent and restorable, right?

Definitely give Restic a chance, and it won’t dissapoint you.

z0mb1e-kgd · April 24, 2024, 4:04pm

Thank you, I’ve given it a try. The DBMS performs its native backup every day, but the thing is it is very straightforward and primitive, as it only closes all the connections to the DB, copies the file to the subdir named backup-[current date and time] in the backup directory (say, D:\backup) and rotates these subdirs so only the last three are there, deleting the rest. No increments, no deduplication, no compression - as simple as it is. The filename of the DB within these subdirs remain the same, exactly as it is in the working directory of the database. I’ve run the Restic backup on the main backup directory with the compression option set to max, and it ended up with the impressive 5 GB of disk usage compared to 12 GB x 3 = 36 GB of the native backups. But there are few things for me to figure out:

Is there a way to instruct Restic to search for and backup a specific filename within all subdirs of a specific directory? As these final files are virtually the same other than the slightly added / changed bits of data. Or Restic internally manages this issue while deduplicating data?
What compression algorithm / library is used in Restic? Is it possible to change it other than changing the compression level option?
Is it possible to decrease the data chunk size so the incremental chunks would take lesser disk space if the DBMS decides to change a byte or two in the database metadata (change an index, recalculate a table hash etc)?
Is it sufficient to upload the data to the cloud by simply specifying the Restic repo as a sync folder within the cloud sync client (namely mega.nz), or uploading with the Restic internal sync mechanism (say, rclone) is recommended?
Thanks.

zcalusic · April 24, 2024, 5:07pm

I’ll answer only your 1. topic. Generally you don’t need to worry at all. Since restic is deduplicating your content, if it notices files it already has, it won’t even read them. Finally it will also use data from those old files to optimally transfer only changes from from the new ones. Test it and you’ll see that restic does all grunt work for you in the background.

So, if you have folder archive with several such backups from your database, just let restic backup it all, and that’s all there is to it.

If you’d really like to precisely specify what goes to backup, restic backup --help will display you many useful --exclude, --files-from and similar flags to help you pick and backup only what you need.

z0mb1e-kgd · April 24, 2024, 6:53pm

Many thanks!

MichaelEischer · April 24, 2024, 8:10pm

restic uses zstd. The algorithm is not configurable.

The average chunk size is hardcoded to 1MB and that is unlikely to change any time soon. We’d need to to completely rearchitect how the repository index works in order to reasonably support smaller chunk sizes. (The overhead of the current index wouldn’t really matter for a 36GB backup, but much larger repositories are also supposed to work.)

Please use the rclone backend. Using a cloud sync client probably also works most of the time, however, then restic has no way to determine whether the backup was actually uploaded correctly.

z0mb1e-kgd · April 25, 2024, 12:25am

Got it, thank you very much indeed.

plutocrat · May 1, 2024, 2:34am

I’ll just add that maybe changing the way you do the db dump might help things. If, instead of dumping out one file, you dump each table into a separate file, you might find that some tables change, while others are the same day to day.

z0mb1e-kgd · May 1, 2024, 8:59am

There is no intrinsic way to operate the tables within the db separately (it is a closed-source ERP with very limited means upon managing data), but the DBMS itself dumps the db every day by simply closing all connections and copying the db file to a separate folder, so I needed the way to have these native “backups” to be backed up in the most efficient way to be uploaded to the cloud. Restic turned out to be the proper way, as I got 5 GB of data out of the three rotated daily copies (12+ GB each) and ± 90 MB of daily increments (with the compression level set to max), that is pretty acceptable in my case.