Is it okay to keep all snapshots forever?

jamesbond · June 19, 2020, 12:32pm

Hi, I discovered restic just about a week ago and I really have to say it seems to be an awesome piece of software. After watching the video from the C4 talk, reading the user documentation and doing some first tests, I have created a bb2 cloud storage account and I have set up scripts and cron jobs to perform fully automated backups every night. I have also tested to restore parts of the backup on another (virtual) machine that is running a different distro. In the course of this I ran into some minor issues that were pretty much solvable.

But there still is one topic that I completely ignored until now: I haven’t taken the time to read about and understand the concept of removing snapshots with forget and prune. What I am wondering in this context is: As I currently have less than 200 GB of data that needs to be backed up and only small amounts of data are added or changed daily, do I really need to bother with deleting old snapshots? Or is it a completely reasonable and recommendable approach to just keep all snapshots forever?

Would I have to expect to run into (performance?) issues, problems or restrictions if my restic repository e.g. contains a four-digit number of snapshots in about three years? And would it be enough to start thinking about using forget and prune when

a) the expenses for bb2 cloud storage exceed an acceptable level or

b) the regular restic operations seem to be affected in some way by the large number of snapshots?

Many thanks in advance for every reply.

alexweiss · June 19, 2020, 7:15pm

If you are satisfied with your storage costs and with the speed of restic operations, there is of course no need to run forget and prune

Lots of snapshots and lots of index files - you most likely will encounter both when not using forget and prune - will however decrease your performance. So think about pruning when you are no longer satisfied with your performance!

A side note: There are some current improvements for prune under development which will make prune work very well in your scenario. So it might be a good strategy to wait another 6-12 month until this has been settled into a restic release.
There are also speed improvements in the unreleased master or under development which may make you stay longer satisfied without pruning, so in any case it’s good to take a look at new releases

nunop · July 4, 2020, 10:25am

@alexweiss

Good to know!
Do you know if in a perspective of AWS S3 and Data Transfer, will those improvements make “prune” use less data transfer?

As per post: Huge amount of data read from S3 backend

prune downloads every pack header to create a temporary index, crawls all snapshots (which means downloading every tree object that can be reached from any snapshot), downloads any blobs that are still used and exist in the same pack as an object to be deleted, re-uploads these blobs, deletes the old packs, then reindexes again (downloading every pack header a second time).

If you do this frequently, the traffic adds up pretty quickly.

Thank you very much!

alexweiss · July 4, 2020, 11:41am

The re-implementation I propose in

github.com/restic/restic

Reimplementation of prune

restic:master ← aawsome:new-cleanup-command

opened 09:00PM - 01 May 20 UTC

aawsome

+596 -205

What is the purpose of this change? What does it change? ----------------------…---------------------------------- Reimplement `prune` such that the main problems are solved: - `prune` is slow, especially for remote repositories - `prune` uses too much memory - `prune` is not configurable Features are: - Cancels if repository is in an invalid state. When running and being canceled at any time it should leave a working repository state. Re-pruning of a repo resulting from an aborted prune should safely work. This makes the pruning process really safe. - Find out what to do by only reading snapshots, the in-memory index and the list of pack files (-> this is **very fast** and uses almost no bandwidth!) - User options that are easy to understand but still allow to fine-tune the pruning result - Optimize such that the size of packs-to-repack is minimal w.r.t. the given user parameters. This minimizes the necessary download of pack files (improves speed and bandwidth usage **a lot** for remote repos) and even allows to prune while only using data that is already present in the repo cache (interesting e.g. for cold storage) - Builds the new index without the need to read load all packs (=> improves speed and bandwidth **a lot** for remote repos) -> This feature is no longer included, has been moved to #2842 - Flag for dry run to see what would be done Output looks like this: ``` enter password for repository: repository 28fa4257 opened successfully, password is correct get all snapshots load indexes find data that is still in use for 2 snapshots [0:00] 100.00% 2 / 2 snapshots find packs in index and calculate used size... collect packs for deletion and repacking... [0:00] 100.00% 5 / 5 packs processed to repack: 69 blobs / 1.078 MiB this removes 67 blobs / 1.047 MiB to delete: 7 blobs / 25.726 KiB total prune: 74 blobs / 1.072 MiB unused size after prune: 0.00% of total size repacking packs... [0:00] 100.00% 2 / 2 packs repacked counting files in repo [0:00] 100.00% 3 / 3 packs finding old index files saved new indexes as [622627ee] remove 4 old index files [0:00] 100.00% 4 / 4 files deleted remove 3 old packs [0:00] 100.00% 3 / 3 files deleted done. ``` Moreover `forget` is enhanced: - uses re-implemented `prune`, of course - allow real `prune` simulation with `restic forget --dry-run --prune` This PR has the following prerequisites: - [x] #2840 - [x] #2841 - [x] #2844 (#2839 is not a must any longer; #2842 is now independent; #2843 is included here) Was the change discussed in an issue or in the forum before? ------------------------------------------------------------ see #2547 Prune issues have been widely discussed, e.g. #1140 #1599 #1723 #1985 #2162 #2227 #2305 There are also other PR trying to improve the situation, see #1994 #2340 #2507. closes #1140 closes #1723 closes #1985 closes #2112 closes #2227 closes #2305 Checklist --------- - [X] I have read the [Contribution Guidelines](https://github.com/restic/restic/blob/master/CONTRIBUTING.md#providing-patches) - [X] I have enabled [maintainer edits for this PR](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/allowing-changes-to-a-pull-request-branch-created-from-a-fork) - [X] I have added tests for all changes in this PR - [x] I have added documentation for the changes (in the manual) - [x] There's a new file in `changelog/unreleased/` that describes the changes for our users (template [here](https://github.com/restic/restic/blob/master/changelog/TEMPLATE)) - [X] I have run `gofmt` on the code in all commits - [X] All commit messages are formatted in the same style as [the other commits in the repo](https://github.com/restic/restic/blob/master/CONTRIBUTING.md#git-commits) - [x] I'm done, this Pull Request is ready for review

will drastically reduce the data transfer. Especially if you only have one system to backup from and choose prune options such that only few packs need to be repacked, it will mainly use the cache and doesn’t download or re-upload much.

In fact, I’m already using it (with a few other patches) with a repository on a cold storage. The prune command then only saves a few new files (mainly index files) and removes files from the repository without accessing those.

nunop · July 4, 2020, 12:41pm

Oh man… thanks for this!
This needs to be released as Critical Emergency!

Really appreciate that.
I subscribed the discussion to stay updated.