Side by Side with Duplicity for a Large MySQL Database

I have a MySQL database that has 10 years of posts from phpBB and its .sql export is currently at 167MB. I back it up daily “all by itself” to a single B2 Bucket. Bottom line: large text file with very small percentage of changes each day.

Because the Bucket seemed to be growing in size very quickly since I switched to restic from duplicity, I just did a side-by-side test with duplicity, specifying the single file. Here are the Bucket Sizes:

  • Day 0 (full backup) - restic 174.2MB; duplicity 47.1MB
  • Day 1 (incremental after file changed slightly) - restic 294MB; duplicity 48.9MB

I have read extensively on the efforts to add compression to restic, so I understand and “can live with” that problem, but I could not find anything (though I had trouble searching) on the incremental approach used by restic, other than a few mentions of detecting identical chunks.

Any thoughts on reducing B2 storage required? I couldn’t find any options that might help. The only options that I am using are --cache-dir on backup and b2.connections=7 on init.

Hi, and welcome to the forum!

I’m not sure what’s going on exactly. The larger repo size is probably duplicity compressing the easily compressible ASCII SQL text.

I’m wondering though why the restic repo grows so much. Could you check the differences between two subsequent snapshots with restic diff? Please post the result here, maybe we get a hint what’s going on.

password is correct
comparing snapshot 88c17760 to a06b274d:

M    /rwphpbb.sql

Files:           0 new,     0 removed,     1 changed
Dirs:            0 new,     0 removed
Others:          0 new,     0 removed
Data Blobs:     82 new,    82 removed
Tree Blobs:      0 new,     0 removed
  Added:   114.182 MiB
  Removed: 114.182 MiB

Please let me know if I should have added any flags to the restic diff to display more info.

I now have Day 2 info to add. To summarize:

  • Day 0 (full backup) - restic 174.2MB; duplicity 47.1MB
  • Day 1 (incremental after file changed slightly) - restic 294MB; duplicity 48.9MB
  • Day 2 - restic 421.2MB; duplicity 50.9MB

In case it helps, here is the standard output during the Day 1 Incremental backup from Duplicity:
Local and Remote metadata are synchronized, no sync needed.
Last full backup date: Sat Feb 3 14:21:56 2018
--------------[ Backup Statistics ]--------------
StartTime 1517759537.07 (Sun Feb 4 08:52:17 2018)
EndTime 1517759538.67 (Sun Feb 4 08:52:18 2018)
ElapsedTime 1.60 (1.60 seconds)
SourceFiles 1
SourceFileSize 174168612 (166 MB)
NewFiles 0
NewFileSize 0 (0 bytes)
DeletedFiles 0
ChangedFiles 1
ChangedFileSize 174168612 (166 MB)
ChangedDeltaSize 0 (0 bytes)
DeltaEntries 1
RawDeltaSize 3048707 (2.91 MB)
TotalDestinationSizeChange 785150 (767 KB)
Errors 0

Is there both deletes, changes and additions in the database? Perhaps there’s simply so much going on that restic can’t deduplicate it very well (assuming it has a larger block size than Duplicity).

Mostly adds, but it is a phpBB forum’s database, so there are user-related rows that keep track of session info, post counts, etc.

This is not the actual database, just the mysqldump export file, which would be the usual gazillions of lines of SQL INSERTs and such.

Yeah, the SQL dump. Do you know what block size Duplicity uses?

I don’t specify block size, so the default listed in man should be correct. Here is the complete description:
–max-blocksize number
determines the number of the blocks examined for changes during the diff
process. For files < 1MB the blocksize is a constant of 512. For files over
1MB the size is given by:

      file_blocksize = int((file_len / (2000 * 512)) * 512)
      return min(file_blocksize, globals.max_blocksize)

      where globals.max_blocksize defaults to 2048.  If you specify a larger
      max_blocksize, your difftar files will be larger, but your sigtar files will
      be smaller.  If you specify a smaller max_blocksize, the reverse occurs.  The
      --max-blocksize option should be in multiples of 512.

Hm, it looks like for that particular database, restic’s chunk size (~1MiB) is too large. The output of the diff command shows that the files in the two snapshots do not have a single blob in common. That’s unexpected, at least for me.

Thanks for the info. After a little thought, I think that I would rather stick with restic than run two different backup technologies. Since I already have my own Backup Server (a Ubuntu VPS), I’ll find a good “diff” tool and store the daily Deltas on the Backup Server, one directory per database, and simply backup those directories to B2 daily.

One thought I had today was that you can try compression with gzip --rsyncable and pipe that to restic…

I’ll give that a try for a couple of days (full, incremental 1, incremental 2) and report back on my results. Thanks for the great idea!

I did the three days in a row backups again, just as I did before, but on a gzip version of the database with --rsyncable specified. The results, with snapshot ID:

  • Day 0 86aeea0f - 46.2MB

  • Day 1 a73abe23 - 81.9MB

  • Day 2 b92c6a65 - 117.7MB
    comparing snapshot 86aeea0f to a73abe23:

    M /rwphpbb.sql.gz

    Files: 0 new, 0 removed, 1 changed
    Dirs: 0 new, 0 removed
    Others: 0 new, 0 removed
    Data Blobs: 25 new, 24 removed
    Tree Blobs: 0 new, 0 removed
    Added: 34.077 MiB
    Removed: 34.100 MiB
    comparing snapshot a73abe23 to b92c6a65:

    M /rwphpbb.sql.gz

    Files: 0 new, 0 removed, 1 changed
    Dirs: 0 new, 0 removed
    Others: 0 new, 0 removed
    Data Blobs: 24 new, 25 removed
    Tree Blobs: 0 new, 0 removed
    Added: 34.096 MiB
    Removed: 34.077 MiB

Same problem here, I make a backup of a full sql server hosting 33 databases.
Each DB is backed in sql.gz file dumped by mysqldump

Here is the result of a diff of the last two snapshots

Files:           0 new,     0 removed,    33 changed
Dirs:            0 new,     0 removed
Others:          0 new,     0 removed
Data Blobs:    198 new,   191 removed
Tree Blobs:      1 new,     1 removed
  Added:   240.140 MiB
  Removed: 243.095 MiB

For dynamicnet and anyone else reading this thread, a little background would be helpful.

Before moving to restic relatively recently, I had been using duplicity for backup (to B2) for about 2 years. Back then, it quickly became apparent that a daily incremental backup of a single compressed file would not be much smaller than doing a full backup each day. Which is why I switched to leaving the .sql files uncompressed. And that worked well with Duplicity because of its relatively small delta comparison block size, and its automatic compression before storing on B2.

Based on my understanding of modern compression algorithms, a given line in an .sql file may not change from day to day, but, if some of the rest of the file changes, gzip will not create the same bit pattern for that line in the .sql file.

I am currently looking at tools to easily create, maintain and restore from a space-efficient daily delta files for a large .sql file. I’ll keep the full set of them on my backup server and back them up to B2 with restic.

At an even more basic level, Incremental can be a two level approach to Backup.

A simple one-level approach would be to only save the files that have changed since the last backup. Complete files would always be saved, never pieces of files.

A second level looks at those changed files that are being saved. The contents of the most recently backed up copy of the file and the current file are compared, and only the differences are saved.

Just for the record: If you compress the SQL dump with gzip, it will almost certainly be a completely different file: The entropy is reduced by compressing it, so restic will not find any old blobs and will rather re-archive the complete file with new blobs…

Just a thought: if you were to use the --rsyncable option in gzip, it might overcome this effect to some extent, as changes will be somewhat localised, so many blobs may remain unchanged.

Further up, fd0 suggested that, but restic did not appear to find any matches while doing the incremental though, of course, the size was a lot smaller. The percentage growth per backup remained the same.