Beyond release binaries: performance vs. data integrity?

Hi,
I recently compiled my own binary in order to add a couple of features that aren’t in 0.9.6. I’m very happy with the resulting performance-improvement, but have no sense for whether such an approach could risk the soundness of my backups (there’s no point improving performance if you compromise your ability to restore!).

There’s probably no black-or-white answer, but I’d appreciate any opinions or advice from the many experienced people here: how safe in general do you feel it is for a relatively-competent-non-expert to be venturing beyond the release binaries? Is this something that you’d simply never advise with a restic-focussed backup solution? Or, in combination with things like restic check --read-data, do you think there is minimal increased risk?

Details of the changes and benchmarks follow if you’re interested, although this isn’t the main point of the question.

Starting from 0.9.6, I added:

  1. Larger pack sizes (following @fd0’s instructions here). I understand the historic reasons for a 4Mb default, but suspected this was a bottleneck in my use-case, particularly with SMR drives
  2. Copy functionality as per PR#2606.

For those, like me, with no previous git or go experience, here’s how I did it:

details

Get the master branch:
git clone https://github.com/restic/restic

From within the generated directory, get the desired pull request:
cd restic
git fetch https://github.com/restic/restic refs/pull/2606/head:pr2606
git checkout pr2606

Change the minPackSize, by altering line 39 in internal/repository/packer_manager.go to (in my case):
const minPackSize = 128 * 1024 * 1024

Then compile for my current system:
go run build.go
and for windows:
go run build.go --goos windows --goarch amd64

And here’s the impact on my NAS-based repo. The desired SMR-copying speedup is better than hoped: times are now comparable to the PMR reference drive. I suspect big prunes may turn out to be slower, but I’m happy with that trade-off.

operation 4Mb minPack 128Mb minPack
backup (minimal changes) 35 min 35 min (0%)
empty prune (no changes) 110 min 27 min (-75%)
check 1% of repo 90 min 86 min (-4%)
rsync-to-USB (Seagate 1Tb SMR, from blank) 54.2 hrs 10.3 hrs (-81%)
rsync-to-USB (Seagate 1Tb SMR, no changes) 88 sec 7 sec (-92%)
rsync-to-USB (WD 8Tb, from blank) 12 hrs 10.4 hrs (-13%)
rsync-to-USB (WD 8Tb, no changes) 84 sec 4 sec (-95%)
directory size before size after files before files after
data 816G 823G* 171,318 7,124
index 212M 204M 58 341
keys 4.0K 4.0K 1 1
locks 4.0K 0 2 0
snapshots 1.4M 1.4M 341 341
total 817G 824G* 171,721 7,808

*The first prune on the newly-created repo (a restic copy of the original) found 31,220 duplicate blobs. That explains the increased repo size after copying (the sizes matched, once the prune was complete). I can’t think why the duplicates should exist, so wonder if this is either a bug in PR#2606, or a problem caused by the change in minPackSize?

If you’ve some free time,try also this improved prune implementation:

1 Like

Using the current master branch should be rather safe as commits that get merged have been thoroughly reviewed before. For all the various PRs it’s hard to make general statements. PRs that just make small output changes are rather safe and usually also PRs that already got some positive feedback.

The copy functionality of PR #2606 is afaict safe to use. It just not optimized for performance yet. The prune changes in #2718 should be rather safe, but I haven’t had time to take a look at the latest revision of that PR. It should make the prune operation much faster, while also removing most of the performance difference of prune for 4 vs. 128 MB packs.

The duplicate blobs are in fact a bug in #2606, which sometimes fails to deduplicate new file chunks that are contained multiple times in a new snapshot (it won’t break your repository, just waste a bit of space). Thanks for noticing.

Using your git command you can only use a single PR and you don’t get new features added to the master branch. To do that you could use the following commands:

# get the current master
git checkout master
git pull

# create integration branch
git checkout -b integration

# merge in other branches
git merge pr2606 ...

# build restic

If you encounter merge conflicts you can use git merge --abort to cancel the merge or try to fix them manually (which requires that you know what you’re doing). Resetting the branch to the current master is also possible using e.g. git reset --hard origin/master (Warning: This will also restore all modified files to their original content, without asking!).

1 Like

Many thanks @MichaelEischer - that was just the level of feedback I was hoping for. In general I wouldn’t trust myself to go into PR territory, but #2606 and >minPackSize have now saved the expense of dumping and replacing several painfully-slow SMR drives.

@dionorgua: as a probable-future-beneficiary of #2718, many thanks for your work on it! However for now I’m done tinkering with my backup system, before I accidentally break anything :wink:

It’s not my work. I’m just testing it right now on my data…