Testing the new prune implementation on a 250 GB repo

I’ve been following the recent improvements to prune and wanted to test it out in a real-world scenario. To do this I created two local (USB HDD) copies of a 250 GB repository used for daily backups, then ran forget, prune and check commands using restic 0.11.0 as well as restic built from a recent checkout of master that includes the new prune implementation. Here are the results.

Before getting to the results I just want to take a moment to thank all the contributors for the impressive efforts you (still) put into improving restic, making all our lives a bit safer every day. A huge THANK YOU to all of you. :love_letter:

Summary

Straight to the numbers:

Action 0.11.0 0.11.0-dev
Initial prune 27:37 5:20
Forget prune 1:01:05 12:17

In both prune runs the new implementation was faster by about a factor of 5. This is definitely a game changer! :trophy:

After each prune run I did a check --read-data to validate the consistency of the repos. In all cases check returned “no errors were found”.

Suggestions

The only comment (besides praise) I have is that the UI is slightly inconsistent when running the new prune command. Particularly the ... (dots) after loading all snapshots... and loading indexes... don’t seem to figure in any other command I’ve run. It’s even a bit inconsistent in the new prune output in that deleting unreferenced packs doesn’t have dots. My suggestion would be to remove them. So yeah, I only have very minor complaints really. :slight_smile:

Logs

Below are logs for each run, with some output truncated using (...). Note: Some of the logs get truncated by Discourse, so you have to scroll down within the code block to see all of the output. Threw me off for a moment.

restic 0.11.0

> restic version
restic 0.11.0 compiled with go1.15.3 on darwin/amd64
> restic check --read-data
(...)
147 additional files were found in the repo, which likely contain duplicate data.
You can run `restic prune` to correct this.
check snapshots, trees and blobs
read all data
[3:09:44] 100.00%  57043 / 57043 packs
no errors were found
> time restic prune
repository c69bac5b opened successfully, password is correct
counting files in repo
building new index for repo
[10:53] 100.00%  57190 / 57190 packs
incomplete pack file (will be removed): 437d57f4e293a5864f05d60517643724af99d112b1683ea9b7ecb8e68b933bdc
incomplete pack file (will be removed): 737074473f556faf1518a11176e14c223b646a0ebd1a40ddb29683e740afcd86
repository contains 57188 packs (2897659 blobs) with 254.976 GiB
processed 2897659 blobs: 1662 duplicate blobs, 463.567 MiB duplicate
load all snapshots
find data that is still in use for 256 snapshots
[5:02] 100.00%  256 / 256 snapshots
found 2894053 of 2897659 data blobs still in use, removing 3606 blobs
will remove 2 invalid files
will delete 10 packs and rewrite 257 packs, this frees 647.997 MiB
[0:49] 100.00%  257 / 257 packs rewritten
counting files in repo
[10:09] 100.00%  57041 / 57041 packs
finding old index files
saved new indexes as [41302cf5 02252aea 776080d2 5ca3accf 16449dda 1558c13a ab3de1bb 68ae8a36 9acfa47b b81895c4 3b575512 601e009a a7075361 8a0cfd8b 180b9c08 72d7de76 ebae24cd e7d17a1b c38d8319 9d5ee51a]
remove 52 old index files
[0:00] 100.00%  52 / 52 files deleted
remove 267 old packs
[0:11] 100.00%  267 / 267 files deleted
done
restic prune  282.14s user 151.69s system 26% cpu 27:37.72 total
> restic check --read-data
(...)
[3:08:52] 100.00%  57041 / 57041 packs
no errors were found
> restic forget -l 10 -H 20 -d 30 -w 40 -m 50
(...)
[0:00] 100.00%  152 / 152 files deleted
> time restic prune
repository c69bac5b opened successfully, password is correct
counting files in repo
building new index for repo
[9:33] 100.00%  57041 / 57041 packs
repository contains 57041 packs (2894053 blobs) with 254.349 GiB
processed 2894053 blobs: 0 duplicate blobs, 0 B duplicate
load all snapshots
find data that is still in use for 104 snapshots
[3:50] 100.00%  104 / 104 snapshots
found 2473103 of 2894053 data blobs still in use, removing 420950 blobs
will remove 0 invalid files
will delete 3867 packs and rewrite 10520 packs, this frees 35.304 GiB
[33:52] 100.00%  10520 / 10520 packs rewritten
counting files in repo
[10:38] 100.00%  48626 / 48626 packs
finding old index files
saved new indexes as [58d5d7d5 2bbf9473 bbea9893 3510c039 9a76ebdb db89c154 181ff4bd 858a3769 4721cb52 e2fc3160 32bb529b b1b215ff 3ecfbb30 a477f7de f1ff1287 8a69f6e9 eb3d8bb4]
remove 21 old index files
[0:00] 100.00%  21 / 21 files deleted
remove 14387 old packs
[2:33] 100.00%  14387 / 14387 files deleted
done
restic prune  727.51s user 291.32s system 27% cpu 1:01:05.76 total
> restic check --read-data
(...)
[2:46:14] 100.00%  48626 / 48626 packs
no errors were found

restic-0.11.0-dev (6822ce8)

> ./restic_darwin_amd64 version
restic 0.11.0-dev (compiled manually) compiled with go1.15.3 on darwin/amd64
> ./restic_darwin_amd64 check --read-data
(...)
147 additional files were found in the repo, which likely contain duplicate data.
You can run `restic prune` to correct this.
check snapshots, trees and blobs
read all data
[3:10:14] 100.00%  57043 / 57043 packs
no errors were found
> time ./restic_darwin_amd64 prune
repository c69bac5b opened successfully, password is correct
loading all snapshots...
loading indexes...
finding data that is still in use for 256 snapshots
[4:54] 100.00%  256 / 256 snapshots
searching used packs...
collecting packs for deletion and repacking
[0:02] 100.00%  57043 / 57043 packs processed

to repack:            0 blobs / 0 B
this removes          0 blobs / 0 B
to delete:            0 blobs / 635.047 MiB
total prune:          0 blobs / 635.047 MiB
remaining:      2894119 blobs / 254.355 GiB
unused size after prune: 8.009 MiB (0.00% of remaining size)

deleting unreferenced packs
[0:01] 100.00%  147 / 147 files deleted
done
./restic_darwin_amd64 prune  226.59s user 67.75s system 91% cpu 5:20.29 total
> ./restic_darwin_amd64 check --read-data
(...)
[3:11:25] 100.00%  57043 / 57043 packs
no errors were found
> ./restic_darwin_amd64 forget -l 10 -H 20 -d 30 -w 40 -m 50
(...)
[0:00] 100.00%  152 / 152 files deleted
> time ./restic_darwin_amd64 prune
repository c69bac5b opened successfully, password is correct
loading all snapshots...
loading indexes...
finding data that is still in use for 104 snapshots
[3:47] 100.00%  104 / 104 snapshots
searching used packs...
collecting packs for deletion and repacking
[0:02] 100.00%  57043 / 57043 packs processed

to repack:       102019 blobs / 8.160 GiB
this removes      87048 blobs / 7.492 GiB
to delete:       170583 blobs / 16.307 GiB
total prune:     257631 blobs / 23.799 GiB
remaining:      2636488 blobs / 230.556 GiB
unused size after prune: 11.527 GiB (5.00% of remaining size)

repacking packs
[6:43] 100.00%  1966 / 1966 packs repacked
rebuilding index
[0:07] 100.00%  51373 / 51373 packs processed
deleting obsolete index files
[0:00] 100.00%  51 / 51 files deleted
removing 5833 old packs
[1:17] 100.00%  5833 / 5833 files deleted
done
./restic_darwin_amd64 prune  248.72s user 87.25s system 45% cpu 12:17.28 total
> ./restic_darwin_amd64 check --read-data
(...)
[2:52:54] 100.00%  51373 / 51373 packs
no errors were found
4 Likes

Thank you very much for taking the time to test the new code! The difference is even higher when you have remote repos, latency (even small latency) adds up to quite a lot! :slight_smile:

1 Like

Hi, I gave the new prune implementation a try, and it does seem to be significantly faster. But it also seems to be less efficient.

I ran restic prune on an repo with ~280GB of data

$ restic version
restic 0.11.0 (v0.11.0-161-g04d856e6) compiled with go1.15.5 on linux/amd64
$ restic prune
[0:00] 100.00%  16 / 16 files deleted
16 snapshots have been removed, running prune
loading all snapshots...
loading indexes...
finding data that is still in use for 44 snapshots
[3:04] 100.00%  44 / 44 snapshots
searching used packs...
collecting packs for deletion and repacking
[0:08] 100.00%  61491 / 61491 packs processed

to repack:            0 blobs / 0 B
this removes          0 blobs / 0 B
to delete:        13848 blobs / 8.029 GiB
total prune:      13848 blobs / 8.029 GiB
remaining:      2536330 blobs / 279.746 GiB
unused size after prune: 4.960 GiB (1.77% of remaining size)

rebuilding index
[0:19] 100.00%  59773 / 59773 packs processed
deleting obsolete index files
[0:00] 100.00%  40 / 40 files deleted
removing 1718 old packs
[0:00] 100.00%  1718 / 1718 files deleted
done
$ du -hs home-and-data/
281G	home-and-data/

So it took about 3min for the prune run. Really nice!

I then re-ran restic prune with v0.10.0

$ /usr/bin/restic version
restic 0.10.0 compiled with go1.15.2 on linux/amd64

$ /usr/bin/restic prune
repository 14371fc4 opened successfully, password is correct
counting files in repo
building new index for repo
[15:26] 100.00%  59773 / 59773 packs
repository contains 59773 packs (2536330 blobs) with 279.836 GiB
processed 2536330 blobs: 0 duplicate blobs, 0 B duplicate
load all snapshots
find data that is still in use for 44 snapshots
[3:23] 100.00%  44 / 44 snapshots
found 2476481 of 2536330 data blobs still in use, removing 59849 blobs
will remove 0 invalid files
will delete 0 packs and rewrite 1753 packs, this frees 4.960 GiB
[6:16] 100.00%  1753 / 1753 packs rewritten
counting files in repo
[14:46] 100.00%  58683 / 58683 packs
finding old index files
saved new indexes as [c8d84d50 4f6857b5 ad294a75 672d2d9c 8858761b e6e6cafb aa4f363a bd57d736 5f2a3e4d 1cf3bf29 c70465e4 0cf73288 4a0d1262 4d8f26c5 c4ec4560 57b43fec 8c7c526f bd9ff162 0cf5341a 4dc7975e]
remove 52 old index files
[0:00] 100.00%  52 / 52 files deleted
remove 1753 old packs
[0:00] 100.00%  1753 / 1753 files deleted
done

$ du -hs home-and-data/
276G	home-and-data/

This took almost 40min, but it managed to reduce the size of the repo by 5GB.
Is the new implementation not doing the repack step?

One part of the speed-up is that you can allow unused blobs still occupying space in the repository. This can reduce the number of packs to repack a lot.
The default value of --max-unused is 5% of the repo size. In your example the unused space can be put below this by only removing completely unused pack files. So no repacking is needed.

If you need to reduce the occupied space as much as possible you can specify --max-unused=0. This will do the same repacking, but should be still much faster than the older version due to other optimizations.

Thanks for the explanation.