I have a MySQL database that has 10 years of posts from phpBB and its .sql export is currently at 167MB. I back it up daily “all by itself” to a single B2 Bucket. Bottom line: large text file with very small percentage of changes each day.
Because the Bucket seemed to be growing in size very quickly since I switched to restic from duplicity, I just did a side-by-side test with duplicity, specifying the single file. Here are the Bucket Sizes:
Day 0 (full backup) - restic 174.2MB; duplicity 47.1MB
Day 1 (incremental after file changed slightly) - restic 294MB; duplicity 48.9MB
I have read extensively on the efforts to add compression to restic, so I understand and “can live with” that problem, but I could not find anything (though I had trouble searching) on the incremental approach used by restic, other than a few mentions of detecting identical chunks.
Any thoughts on reducing B2 storage required? I couldn’t find any options that might help. The only options that I am using are --cache-dir on backup and b2.connections=7 on init.
I’m not sure what’s going on exactly. The larger repo size is probably duplicity compressing the easily compressible ASCII SQL text.
I’m wondering though why the restic repo grows so much. Could you check the differences between two subsequent snapshots with restic diff? Please post the result here, maybe we get a hint what’s going on.
In case it helps, here is the standard output during the Day 1 Incremental backup from Duplicity:
Local and Remote metadata are synchronized, no sync needed.
Last full backup date: Sat Feb 3 14:21:56 2018
--------------[ Backup Statistics ]--------------
StartTime 1517759537.07 (Sun Feb 4 08:52:17 2018)
EndTime 1517759538.67 (Sun Feb 4 08:52:18 2018)
ElapsedTime 1.60 (1.60 seconds)
SourceFiles 1
SourceFileSize 174168612 (166 MB)
NewFiles 0
NewFileSize 0 (0 bytes)
DeletedFiles 0
ChangedFiles 1
ChangedFileSize 174168612 (166 MB)
ChangedDeltaSize 0 (0 bytes)
DeltaEntries 1
RawDeltaSize 3048707 (2.91 MB)
TotalDestinationSizeChange 785150 (767 KB)
Errors 0
Is there both deletes, changes and additions in the database? Perhaps there’s simply so much going on that restic can’t deduplicate it very well (assuming it has a larger block size than Duplicity).
I don’t specify block size, so the default listed in man should be correct. Here is the complete description:
–max-blocksize number
determines the number of the blocks examined for changes during the diff
process. For files < 1MB the blocksize is a constant of 512. For files over
1MB the size is given by:
file_blocksize = int((file_len / (2000 * 512)) * 512)
return min(file_blocksize, globals.max_blocksize)
where globals.max_blocksize defaults to 2048. If you specify a larger
max_blocksize, your difftar files will be larger, but your sigtar files will
be smaller. If you specify a smaller max_blocksize, the reverse occurs. The
--max-blocksize option should be in multiples of 512.
Hm, it looks like for that particular database, restic’s chunk size (~1MiB) is too large. The output of the diff command shows that the files in the two snapshots do not have a single blob in common. That’s unexpected, at least for me.
Thanks for the info. After a little thought, I think that I would rather stick with restic than run two different backup technologies. Since I already have my own Backup Server (a Ubuntu VPS), I’ll find a good “diff” tool and store the daily Deltas on the Backup Server, one directory per database, and simply backup those directories to B2 daily.
I did the three days in a row backups again, just as I did before, but on a gzip version of the database with --rsyncable specified. The results, with snapshot ID:
Day 0 86aeea0f - 46.2MB
Day 1 a73abe23 - 81.9MB
Day 2 b92c6a65 - 117.7MB
comparing snapshot 86aeea0f to a73abe23:
For dynamicnet and anyone else reading this thread, a little background would be helpful.
Before moving to restic relatively recently, I had been using duplicity for backup (to B2) for about 2 years. Back then, it quickly became apparent that a daily incremental backup of a single compressed file would not be much smaller than doing a full backup each day. Which is why I switched to leaving the .sql files uncompressed. And that worked well with Duplicity because of its relatively small delta comparison block size, and its automatic compression before storing on B2.
Based on my understanding of modern compression algorithms, a given line in an .sql file may not change from day to day, but, if some of the rest of the file changes, gzip will not create the same bit pattern for that line in the .sql file.
I am currently looking at tools to easily create, maintain and restore from a space-efficient daily delta files for a large .sql file. I’ll keep the full set of them on my backup server and back them up to B2 with restic.
At an even more basic level, Incremental can be a two level approach to Backup.
A simple one-level approach would be to only save the files that have changed since the last backup. Complete files would always be saved, never pieces of files.
A second level looks at those changed files that are being saved. The contents of the most recently backed up copy of the file and the current file are compared, and only the differences are saved.
Just for the record: If you compress the SQL dump with gzip, it will almost certainly be a completely different file: The entropy is reduced by compressing it, so restic will not find any old blobs and will rather re-archive the complete file with new blobs…
Just a thought: if you were to use the --rsyncable option in gzip, it might overcome this effect to some extent, as changes will be somewhat localised, so many blobs may remain unchanged.
Further up, fd0 suggested that, but restic did not appear to find any matches while doing the incremental though, of course, the size was a lot smaller. The percentage growth per backup remained the same.