Compression support has landed in master!

akrabu · May 12, 2022, 7:11pm

First time using the new build on my iMac to back up a Window’s user profile after running ddrescue on a dying disk, mounting via Dislocker, and running Restic to grab the good stuff!

Total data backed up: 272.32 GiB
Total data after dedup: 179.08 GiB - 66% original size
Total data actually written to disk (compressed): 118.56 GiB - 44% original size!

Eli6 · May 12, 2022, 7:23pm

Cool! This is at max compression?

akrabu · May 12, 2022, 8:03pm

That was at auto. I intend to mostly back up using max compression, but in this case I was on a time crunch! I’m currently copying my old repository over at max compression, and it’s took all week and is a little over halfway done haha

With that in mind, it copied everything in 1:15:57. So, it read at roughly 59.72 MB/s! This was from a M.2 SATA SSD to a Fusion Drive repo (1TB SSD cache + 5TB HDD). Normally I get 300MB/s easily out of it when the cache isn’t full, but I had been copying the old repo over so it had probably slowed down to HDD speeds.

yorkday · May 13, 2022, 6:26am

Awesome - this is fantastic!
Can someone share some information about the type of compression algorithm used and why that was chosen?
I tried to find it in the documentation, but didn’t have any luck.
The statistics already posted show that it seems to be efficient and performant, so that is great to hear.

Nev · May 13, 2022, 10:25am

Further to @yorkday’s question, could anyone also comment on the current state of (de-)compression parallelism?

On my system (rpi4 & NAS), auto compression gives only a minor speed hit (CPU and network load are quite balanced), while max is very CPU limited. It looks like only one core is being used, so there should be scope for speedup?

Either way, even in its current form the devs have done a great job

bazinga · May 13, 2022, 4:38pm

Restic uses Zstandard. It has a great compression rate, adds almost nothing if the data can’t be further compressed, and it is fast to both compress and decompress data.

MichaelEischer · May 13, 2022, 10:01pm

Hmm, restic will read at most two files in parallel from disk, then compress with as many threads as there are CPU cores and afterwards write the compressed parts to the repository. So if restic is using only a single CPU core, that sounds like it’s somehow IO-bounded. Or maybe backing up lots of small files?

Nev · May 13, 2022, 10:11pm

I’m copying an existing V1 repo, if that makes a difference?

MichaelEischer · May 13, 2022, 10:21pm

Sort of. Currently the limited CPU usage is a result of Fix stuck repack step by MichaelEischer · Pull Request #3717 · restic/restic · GitHub and
Stream packs in `check --read-data` and during repacking by MichaelEischer · Pull Request #3484 · restic/restic · GitHub. The performance will be fixed once Asynchronously upload pack files by MichaelEischer · Pull Request #3489 · restic/restic · GitHub and Adjust worker goroutines to number of backend connections by MichaelEischer · Pull Request #3611 · restic/restic · GitHub are merged.

For now you can use -o local.connections=5 to let restic use all four CPU cores.

arkadi · May 14, 2022, 5:27am

I see people say they are transitioning their v1 repos to v2 by using the ‘copy’ command.

Are you talking about this?
https://restic.readthedocs.io/en/stable/045_working_with_repos.html#copying-snapshots-between-repositories

If so, the general idea is

create a new v2 repo
use ‘copy’ command to copy all snapshots into the v2 repo

Are there any downsides to doing the transition this way instead of creating a new v2 repo?

MichaelEischer · May 14, 2022, 12:07pm

It’s best to follow the instructions at the end of the compression changelog entry restic/issue-21 at master · restic/restic · GitHub . As long as you copy the chunker parameters over then the repository should behave similar to a new v2 repo.

akrabu · May 14, 2022, 5:01pm

@arkadi Also make note of this!

I’ll be stopping my current copy “migration” and doing this immediately, too!

Nev · May 14, 2022, 8:29pm

I’ll be interested to hear any feedback on performance with that option @akrabu . After a week, my current migration is now complete and rcloned off-site, so I’m not inclined to delete it and start again just to see if it’s quicker. But if you’re still working on it…

akrabu · May 14, 2022, 9:51pm

So I had let it run from 4pm yesterday on a single snapshot, and it reached 25% by 10am this morning. I canceled when I saw this information at 10am this morning, and at 3pm it’s currently at 16%. Seems like an improvement to me!

ElgatoBavaria · May 17, 2022, 2:28pm

Tested the features also:

Most of the source data are pictures and media files in common. While compression = max was the CPU ( i 9700 , 8 cores ) fully loaded. In my point of view a brilliant feature !

EDIT: Tested with “restic_v0.13.0-147-g88a8701f_windows_amd64.exe”

Best Regards

fd0 · May 29, 2022, 2:13pm

I’ve just merged the other PR which adds the migration for existing repos.

bazinga · May 29, 2022, 2:39pm

This is great. I’ve been playing around with this PR for a while, and I like how flexible the migration process is.

I like that I can use compression for the new blobs and leave the old ones intact, if downloading all my remote backup is an issue.

dhopfm · May 29, 2022, 4:28pm

That’s great news, thanks for all the work!

Upon testing, it seems that prune doesn’t save preliminary indexes (like backup does). With the new use-case of repacking uncompressed data (prune --repack-uncompressed) and potential very long runtimes of days or weeks, any interruption means having to start all over. Would it make sense to add this functionality to prune? Happy to open an issue!

MichaelEischer · May 29, 2022, 5:42pm

It was a conscious decision to disable the creation of preliminary indexes while reworking how prune works to simplify the code. With the current handling of duplicate blobs in prune, saving preliminary indexes can only increase the amount of work for prune but never decrease it. Maybe prune: Handle duplicate blobs more efficiently by aawsome · Pull Request #3290 · restic/restic · GitHub is enough to alleviate most of the problems in that regard.

So unfortunately changing prune to work incrementally requires quite a bit more work than just saving preliminary indexes. But you still can open an issue .

alexweiss · May 30, 2022, 2:31am

@dhopfm you could use --max-repack-size to limit the packs which are repacked and therefore the runtime of one prune run. This way you are able to step-by-step only repack some of your uncompressed data.

About the PR @MichaelEischer mentioned:
This PR alone doesn’t help with the problem that aborted and restarted prune runs start all the repacking from beginning: If only the “old” index files are present, all pack files created by the aborted prune run are considered as unreferenced an not needed. So they are simply deleted at the beginning of the restarted prune run.

However, you can run a rebuild-index after the aborted prune run which will more or less simulate that prune had written preliminary indexes.
Without the mentioned PR this worsens the situation as now an original pack-to-repack and the pack(s) created by the repacking it are both marked for repacking
But with this PR, prune will now choose the original pack-to-repack as completely unused and simply select it for deletion, whereas the pack(s) created by the repacking are usually selected for keeping. So the repacking work is “saved” even for aborted prune runs.

So one way would be to merge that PR and then add logic to save preliminary indexes. BTW, this is how I implemented the prune run in rustic: preliminary indexes are saved and the logic of the mentioned PR is also implemented.