Beyond release binaries: performance vs. data integrity?

Nev · May 20, 2020, 3:13am

Hi,
I recently compiled my own binary in order to add a couple of features that aren’t in 0.9.6. I’m very happy with the resulting performance-improvement, but have no sense for whether such an approach could risk the soundness of my backups (there’s no point improving performance if you compromise your ability to restore!).

There’s probably no black-or-white answer, but I’d appreciate any opinions or advice from the many experienced people here: how safe in general do you feel it is for a relatively-competent-non-expert to be venturing beyond the release binaries? Is this something that you’d simply never advise with a restic-focussed backup solution? Or, in combination with things like restic check --read-data, do you think there is minimal increased risk?

Details of the changes and benchmarks follow if you’re interested, although this isn’t the main point of the question.

Starting from 0.9.6, I added:

Larger pack sizes (following @fd0’s instructions here). I understand the historic reasons for a 4Mb default, but suspected this was a bottleneck in my use-case, particularly with SMR drives
Copy functionality as per PR#2606.

For those, like me, with no previous git or go experience, here’s how I did it:

details

Get the master branch:
git clone https://github.com/restic/restic

From within the generated directory, get the desired pull request:
cd restic
git fetch https://github.com/restic/restic refs/pull/2606/head:pr2606
git checkout pr2606

Change the minPackSize, by altering line 39 in internal/repository/packer_manager.go to (in my case):
const minPackSize = 128 * 1024 * 1024

Then compile for my current system:
go run build.go
and for windows:
go run build.go --goos windows --goarch amd64

And here’s the impact on my NAS-based repo. The desired SMR-copying speedup is better than hoped: times are now comparable to the PMR reference drive. I suspect big prunes may turn out to be slower, but I’m happy with that trade-off.

operation	4Mb minPack	128Mb minPack
backup (minimal changes)	35 min	35 min (0%)
empty prune (no changes)	110 min	27 min (-75%)
check 1% of repo	90 min	86 min (-4%)
rsync-to-USB (Seagate 1Tb SMR, from blank)	54.2 hrs	10.3 hrs (-81%)
rsync-to-USB (Seagate 1Tb SMR, no changes)	88 sec	7 sec (-92%)
rsync-to-USB (WD 8Tb, from blank)	12 hrs	10.4 hrs (-13%)
rsync-to-USB (WD 8Tb, no changes)	84 sec	4 sec (-95%)

directory	size before	size after	files before	files after
data	816G	823G*	171,318	7,124
index	212M	204M	58	341
keys	4.0K	4.0K	1	1
locks	4.0K	0	2	0
snapshots	1.4M	1.4M	341	341
total	817G	824G*	171,721	7,808

*The first prune on the newly-created repo (a restic copy of the original) found 31,220 duplicate blobs. That explains the increased repo size after copying (the sizes matched, once the prune was complete). I can’t think why the duplicates should exist, so wonder if this is either a bug in PR#2606, or a problem caused by the change in minPackSize?

dionorgua · May 20, 2020, 11:48am

If you’ve some free time,try also this improved prune implementation:

MichaelEischer · May 23, 2020, 2:31pm

Using the current master branch should be rather safe as commits that get merged have been thoroughly reviewed before. For all the various PRs it’s hard to make general statements. PRs that just make small output changes are rather safe and usually also PRs that already got some positive feedback.

The copy functionality of PR #2606 is afaict safe to use. It just not optimized for performance yet. The prune changes in #2718 should be rather safe, but I haven’t had time to take a look at the latest revision of that PR. It should make the prune operation much faster, while also removing most of the performance difference of prune for 4 vs. 128 MB packs.

The duplicate blobs are in fact a bug in #2606, which sometimes fails to deduplicate new file chunks that are contained multiple times in a new snapshot (it won’t break your repository, just waste a bit of space). Thanks for noticing.

Using your git command you can only use a single PR and you don’t get new features added to the master branch. To do that you could use the following commands:

# get the current master
git checkout master
git pull

# create integration branch
git checkout -b integration

# merge in other branches
git merge pr2606 ...

# build restic

If you encounter merge conflicts you can use git merge --abort to cancel the merge or try to fix them manually (which requires that you know what you’re doing). Resetting the branch to the current master is also possible using e.g. git reset --hard origin/master (Warning: This will also restore all modified files to their original content, without asking!).

Nev · May 23, 2020, 8:02pm

Many thanks @MichaelEischer - that was just the level of feedback I was hoping for. In general I wouldn’t trust myself to go into PR territory, but #2606 and >minPackSize have now saved the expense of dumping and replacing several painfully-slow SMR drives.

@dionorgua: as a probable-future-beneficiary of #2718, many thanks for your work on it! However for now I’m done tinkering with my backup system, before I accidentally break anything

dionorgua · May 24, 2020, 7:45am

It’s not my work. I’m just testing it right now on my data…

alexweiss · August 19, 2020, 7:36am

Just for information: IMO this bug is now fixed as

github.com/restic/restic

Fix non-intuitive repo behavior

restic:master ← aawsome:index-uploads+knownblobs

opened 07:21AM - 07 Jun 20 UTC

aawsome

+180 -253

What is the purpose of this change? What does it change? ----------------------…---------------------------------- The current `restic.Repository` implementation has some non-intuitive behavior. While you can use the rather high-level method `SaveBlob` to save a blob it does not check for already saved blobs by default. If you do the check yourself by using `repo.Index().Has()` you encounter the problem that saved blobs in unfinished packs are not yet present in the index and therefore `Has()` usually returns false for just saved blobs. Moreover full indexes are not saved by default but need to be manually saved by regularily calling `SaveFullIndex()`. All these issues are at the moment solved separately in the `internal/archiver` codebase but will not work by default for new implementations. This PR moves the needed logic to `repository` (and the index implementation therein) Moreover it implements a new method `SavePack` in `repository/master_index.go` which saves all index entry for a whole pack. As a side-effect it reduces the code base and simplifies the `archiver` logic. Note that most code changes in this PR are needed modifications in tests, e.g. to make tests generating duplicate blobs or counting the saved blobs still work. Was the change discussed in an issue or in the forum before? ------------------------------------------------------------ See #2523 where the ideas are explained. This PR is a prerequisite for #2606 (see discussion there) Checklist --------- - [x] I have read the [Contribution Guidelines](https://github.com/restic/restic/blob/master/CONTRIBUTING.md#providing-patches) - [x] I have enabled [maintainer edits for this PR](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/allowing-changes-to-a-pull-request-branch-created-from-a-fork) - [x] I have changed all affected tests for all changes in this PR - [ ] I have added documentation for the changes (in the manual) - [x] There's a new file in `changelog/unreleased/` that describes the changes for our users (template [here](https://github.com/restic/restic/blob/master/changelog/TEMPLATE)) - [x] I have run `gofmt` on the code in all commits - [x] All commit messages are formatted in the same style as [the other commits in the repo](https://github.com/restic/restic/blob/master/CONTRIBUTING.md#git-commits) - [x] I'm done, this Pull Request is ready for review

has been merged.