Hi manfredlotz,
Here are our situation: hundred of s3 buckets where hundred of restic save backup data. Some restic clients have one snapshot every 15 minutes and our customers keep them for months.
This buckets have up to 10 or 15TB of data.
Some minio servers provides s3 buckets and they backends are always zfs (via nfs) with compression enabled.
The average zfs compression on zpools that we see is 2%
From my point of view, restic doesn’t need compression
Ah, that’s not quite right: what you’re seeing is that data in the restic repository (stored on zfs) does not compress very well. That’s not surprising: all data in the repo is encrypted and therefore has high entropy. Compression within restic would mean compressing files before encryption/deduplication.
Based on this output, I think that it saved you about a quarter of that; you need to consider the deduplicated size as well. Note that 2.7TB of compressed data was deduplicated. The benefits of compression go down as the benefits of deduplication go up.
For example, let’s say you have 500 copies of the same 1GB file, and that file compresses to 250MB.
The original data set, when compressed, goes from 500GB to 125GB. Once it is deduplicated, it drops further to 250MB.
It’s tempting to say that compression saved you 375GB, but it did not – deduplication alone would have reduced the 500GB data set to 1GB.
Adding compression to deduplication in this scenario only saves 750MB, nowhere near 375GB.
I recently moved from Borg to Restic. One major drawback with Borg is if you had a moderate sized repository (3GB of files in /var/www) Borg doesn’t do incremental backups well - it copies a LOT of data to a new file in the backup location. If you’re storing your backup locally that’s ok, but if you’re backup up to cloud that’s going to use a lot of bandwidth. Restic only backs up what is required, so is much more bandwidth efficient.
I run Restic on an AWS t2.nano that right now has 40MB physical RAM available (but with a lot of RAM used as cache) and 400MB virtual memory free (t2.nano has 512MB RAM). It backups a few GB nightly, 83000 files. CloudWatch says my RAM goes from 37% to 53% during backups, for about 1 minute. Resource usage for this amount of data seems really reasonable.
Unfortunately those stats aren’t worth anything without knowledge about the dataset, number of snapshots etc. Take a look at the stats of 1 of my respoitories.
# restic stats --mode restore-size
repository 21006ba7 opened successfully, password is correct
scanning...
Stats for all snapshots in restore-size mode:
Total File Count: 43591952
Total Size: 4.492 TiB
# restic stats --mode raw-data
repository 21006ba7 opened successfully, password is correct
scanning...
Stats for all snapshots in raw-data mode:
Total Blob Count: 692615
Total Size: 76.261 GiB
Those stats are really impressive but the repository contains lots of snapshots with mostly static data.
What would be really great is a comparison of restic and Borg with real life data to see if compression is worth the effort. But as Borg is restricted to SSH backends I haven’t bothered to use it in a while.
That’s awesome, restic is by far faster indeed. I noticed that not that long ago because I use both programs and Borg is taking longer to save changes (note that I use restic with sftp backend and borg is saving data directly to an external HDD and still restic is much faster). That besides the fact that you don’t need a server-side setup for restic made me change from Borg to restic completely.
The only downside with restic is compression but I assume that compression will slow down restic too because it has to open compressed archives and then do the whole operation.
Actually, I’m currently evaluating which backup software might be best for me to use. I did look only at borg and restic. Others, I didn’t investigate further. For example, duplicati is a mono based application which is a nogo for me.
No compression is a drawback for restic. But my test showed that without compression it is not as bad as I had assumed previously.
Being such fast comes with goroutines. This is really nice, and if restic is too wild using the resources there are possibilities (nice, ionice,…).
I don’t think that compression would slow down restic so much because there are lightweight compression methods which are pretty fast.
Although not ideal (but what is ideal in a world of duality?) for me restic is the winner of the game.
I don’t think deciding by file type would be the right way. Restic saves data in chunks and a distinct chunk can belong to different files - and even files of different type.
Compression should be at chunk level and (without knowing the code, my knowledge of Go is very limited) it should be easy to implement: when writing a chunk, Restic tries to compress it, if it doesn’t shrink, it will write it uncompressed. A single flag for the compression type (none, lz4, …) in each chunk header would be sufficient. Hash values would be always for the uncompressed chunk, therefore nothing else in the code path has to be changed, only reading/writing chunks are affected.
On the other hand I am sure that a chunk (compressed or not compressed) has a checksum (i.e. the checksum of the raw data). When restic encounters a zip file and decides not to compress then it could check if the checksums of the zip file’s chunks are already existing. If a chunk is alread existing then it doesn’t matter if the existing chunk was compressed or not. The checksum is the criteria.
I don’t know but we must check too,how fast is the restore, and how fast is the prune and check, because is part of every backup. and restic is very slow at least on my machines…
For the record, most of the shortcomings mentioned here have been addressed: the code for the archiver and restic prune was rewritten from scratch and we’ve just merged compression support to master, see