Quicker interrupted backup resumption

onionjake · January 14, 2021, 6:06pm

Restic periodically saves indexes every 10 (15?) minutes during a backup so that you won’t have to reupload the data again if your backup gets interrupted. Unfortunately restic requires a snapshot in order to skip chunking the files again. On very large backups (1-8TB) this creates quite a problem in order to successfully finish a first backup, especially with lots of small files on slower drives, etc.

It has been proposed that restic save an incremental snapshot (Save an "interrupted" snapshot when backup is interrupted by SIGINT · Issue #2960 · restic/restic · GitHub) if interrupted, but ultimately what the user didn’t understand (see the last comment) was that all that is needed after a failed backup is to start the backup again. I am not necessarily proposing incremental snapshots as the solution to speed up interrupted backups.

Restic being able to skip over files (like it would if there had been a snapshot created) would help users get the first backup, especially if the cost of re-chunking all of the initial data is high enough and the internet connection is spotty enough that the user can’t re-chunk and get back to where the backup was before in time to make additional progress before the internet cuts out or gets interrupted for other reasons.

Something that maybe hasn’t been considered before is to store the necessary data for the ‘incomplete’ snapshot locally only (don’t upload it to the backend). That would keep the complexity out of the main repository, but still enable the next backup that is started to skip re-chunking files.

Hoping to have some discussion here before attempting a PR.

alexweiss · January 14, 2021, 7:41pm

This is already possible. Just create a local repository with identical chunking parameters, make your initial backup to this local repo and then run restic copy to copy this initial snapshot to your remote repo.

restic copy does not need to scan or (re-)chunk any local file, it simply copies each needed blob (i.e. chunks and the tree structure) that is not yet contained in the destination repo. If restic copy is cancelled at some point, all blobs that are saved (and contained in a saved index) will be skipped, if you restart it.

The only issue with restic copy is that the copying of blobs is not (yet) parallelized, see

EDIT: Of course it also works to use rclone or something similar which supports resuming uploads to directly upload the local repo to a remote destination.

odin · January 14, 2021, 8:25pm

This gets less and less practical as the data size increases. That said, I live with unscheduled power outages in winter (did you know high-voltage power transmission lines and massive blizzards don’t pair well?) and definitely agree that this approach is better.

I might think differently if Internet access cut off more often than the power, though.

onionjake · January 14, 2021, 8:52pm

This still requires chunking the files again if the backup is ever interrupted. This would reduce interruptions due to internet issues, but still doesn’t address the underlying problem.

What data does restic need to store about each file in order to confidently skip it for chunking? I assume the path, size, and mtime would be sufficient? This might even be able to be implemented by logging json output of files uploaded as the backup is happening (or perhaps at the index upload every 10 minutes?), redirecting that to a file, then passing that file as an argument to the next backup, something like --partial-backup-log-file.

alexweiss · January 14, 2021, 10:40pm

True. I thought this topic is about unreliable internet connections. The underlying problem is discussed in issue #2960…

This is basically the metadata restic saves in the tree blobs. It is path, size, various times, inode…
But to also create a new snapshot and trees which also contain these skipped files, it also needs the references to the file contents.

I was thinking about consistency problems in that case. restic does already check for the existence of the blobs needed for the file contents and re-creates them if they are missing (the backup “self-healing” feature). So in this case the only thing which might lead to a wrong result for the backup would be wrong file contents references in the log (but which exist in the repo). In that case, restic would just take the wrong contents.

Besides this - should be working. And of course has the advantage that it does also work when the internet connection suddenly stops.

If there is such a logging option, I would however make it optional which may lead to the problem that as you realize you need it to “resume” a backup, you most likely did not enable that needed option…

alexweiss · January 14, 2021, 10:57pm

I thought a bit about the comments of @odin and @onionjake. In fact a some kind of periodically written journal would be needed to properly resume a backup. But in fact, restic already has this kind of journal - it writes tree information to the repository and even updates the index regularly.

If a backup is aborted (by whatever reason), there are tree blobs which are saved and available. So why not simply use them? In fact, there are quite some problems:

These trees are never referenced and we need to first find them (this is similar to what the recover command is doing)
Even worse, there might be other unreferenced trees, e.g. produced by the new prune algorithm
As the trees are not referenced, there is no tree structure. So restic does not know which path to use. E.g. a tree containing files and (files in) subdirs a , dir1/b, dir2/b could be the files /home/user/a, /home/user/dir1/b, /home/user/dir2/b or the files /mnt/a, /mnt/dir1/b, /mnt/dir2/b. So there must be some trying…

So, my proposal would be:

Let the new prune remove all unneeded tree blobs
Add an option to backup which finds all unreferenced trees and then tries to “match” them. That would be an optional possibility which is to be used exclusively to speed up large resuming backups.

odin · January 14, 2021, 11:06pm

Just thinking into the blue here, but … wouldn’t it be possible to leave information about the trees restic was working with in the cache directory in the event of a crash, and check for (and validate) that data at the start of a backup?

(Edit: By which I obviously mean “do it always, so the files are there in the event of a crash”.)

onionjake · January 15, 2021, 12:19am

Kinda thinking along the same lines as @odin, you’d save a partial root tree periodically that knows about all the trees backed up so far? Then the next backup would load that tree and keep going from there. Then all the existing ‘skip’ logic works as is and if it is in a well known location you wouldn’t need to add any new flags.

alexweiss · January 15, 2021, 5:59pm

In principle I like your ideas.

The problem is that IMO saving root trees periodically is only possible with a re-engineering of the archiver code base which I doubt could be tackled in the near future.

The problem about using the cache is the data handling: How/when to remove this information in the cache? How to keep this data in sync with the repo - there might be many backup procceses writing to the repo. I would favor saving this information in the repo directly, but this would need an update of the repo format in some way…

alexweiss · January 15, 2021, 6:08pm

I added these PRs:

github.com/restic/restic

prune: Remove all unused trees

restic:master ← aawsome:prune-all-trees

opened 03:53PM - 15 Jan 21 UTC

aawsome

+4 -4

What does this PR change? What problem does it solve? -------------------------…---------------------------- With this PR `prune` removes all unused trees. It does so by marking all packs containing trees for repacking (except when the `--max-repack-size` option is given). Before, tree packs were priorized as data packs when it came to choosing packs for repacking. I however think that this wasn't a good decision because: - Tree packs are usually available in the cache, so repacking those is cheap. It is better to use the tolerated `--max-unused` for data blobs which are more expensive to access - Tree blobs are usually comparably small and this often leads to small pack files which should be preferably not kept but repacked. - It is very unlikely that a non-referenced tree can ever be re-used, so there is no sense to keep it. - Even in the case that a tree would be re-used by a new backup, it is much cheaper to resave it (as trees blobs are usually small) - Finally: Keeping unreferenced tree blobs may lead to unintended effect, e.g. in `recover`, see the discussion in #2880 Was the change discussed in an issue or in the forum before? ------------------------------------------------------------ No. Checklist --------- - [x] I have read the [Contribution Guidelines](https://github.com/restic/restic/blob/master/CONTRIBUTING.md#providing-patches) - [x] I have enabled [maintainer edits for this PR](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/allowing-changes-to-a-pull-request-branch-created-from-a-fork) - I have not added tests for all changes in this PR - I have not added documentation for the changes (in the manual) - There's no new file in `changelog/unreleased/` that describes the changes for our users (template [here](https://github.com/restic/restic/blob/master/changelog/TEMPLATE)) - [x] I have run `gofmt` on the code in all commits - [x] All commit messages are formatted in the same style as [the other commits in the repo](https://github.com/restic/restic/blob/master/CONTRIBUTING.md#git-commits) - [x] I'm done, this Pull Request is ready for review

github.com/restic/restic

backup: Add --resume option

restic:master ← aawsome:backup-resume

opened 04:50PM - 15 Jan 21 UTC

aawsome

+321 -176

What does this PR change? What problem does it solve? -------------------------…---------------------------- Adds an option `--resume` to the backup command. When specified, this option does: - search for unreferences trees, like `recover`. These trees may exist due to an aborted backup - use these tree as additional "parent trees" for each dir to backup, i.e. do a brute-force check whether these unreferenced trees may contain a suitable parent tree for the dirs to backup => This should effectively speed up "resumed backups", i.e. repetitions of backups that aborted and therefore could not write snapshots. It should especially work well if already big parts were processed during that aborted backup run. There is however the trade-off that searching for unreferenced trees needs to read all trees and all snapshots in the repo. Note that this PR relies on #2880 and on the first commits of #3121 and is therefore marked as draft. Moreover, I strongly recommend to merge #3228 to not have too many unreferenced trees. The code should work already but I did not do any performance tests yet. Was the change discussed in an issue or in the forum before? ------------------------------------------------------------ closes #2280 alternative to #2960 https://forum.restic.net/t/quicker-interrupted-backup-resumption/3470/6 Checklist --------- - [x] I have read the [Contribution Guidelines](https://github.com/restic/restic/blob/master/CONTRIBUTING.md#providing-patches) - [x] I have enabled [maintainer edits for this PR](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/allowing-changes-to-a-pull-request-branch-created-from-a-fork) - [ ] I have added tests for all changes in this PR - [ ] I have added documentation for the changes (in the manual) - [ ] There's a new file in `changelog/unreleased/` that describes the changes for our users (template [here](https://github.com/restic/restic/blob/master/changelog/TEMPLATE)) - [x] I have run `gofmt` on the code in all commits - [x] All commit messages are formatted in the same style as [the other commits in the repo](https://github.com/restic/restic/blob/master/CONTRIBUTING.md#git-commits) - [x] I'm done, this Pull Request is ready for review

Edit: #3229 is closed as it turned out that finding unreferenced trees is too expensive.

MichaelEischer · January 15, 2021, 7:18pm

Just to expand a bit on that: The archiver recursively descends into the directories to backup and on its way back assembles the trees. Saving a root tree would thus require the archiver to stop all processing so it can writea root tree. And then the archiver would have lost all intermediate state and has to start over again…

Once there’s data stored in a “cache” which is not available at the data source then it’s no longer a cache but rather an additional (temporary) data store which probably has all sorts of consequences.

odin · January 15, 2021, 7:44pm

Oh, but I wasn’t even thinking that far. All I was suggesting was that you could save yourself the ‘find unreferenced trees and match them’ step, by saving a log of trees uploaded to the cache directory that would a.) get left behind if the backup process is interrupted and b.) cleaned up when the command finishes.

I don’t know whether that does actually let you resume faster, but it gives an opportunity to solve all three of the issues you mentioned earlier, namely:

The trees aren’t referenced in the repository, but there would be local references. No searching needed.
There might be other unreferenced trees, but since you aren’t searching, you won’t find anything irrelevant.
You could store data relevant to the local status - such as the path being backed up - in the local cache, removing the guesswork.

If you’re not storing anything new to the repository, how could it be worse than currently? If the data that was uploaded has been pruned, well, you start by checking for it and if it’s not there, do what’s being done now. Am I missing something?

I was trying to suggest a log kept locally by the process writing to the repository that would be removed on success, which could enable a speed-up on resumption but in the absence of which everything should fall back to present behaviour. Wouldn’t that count as a cache?

I’m perfectly aware there may be reasons this is technically infeasible; I was just going off the assumption that the suggestion about using tree blobs already saved to the repository was a reasonable one.

alexweiss · January 16, 2021, 1:12pm

@odin I made a PoC using your ideas. See

punchcard · January 18, 2021, 4:30pm

The point of the “resuming” is to prevent the duplicate work. But how does the program know that the work is actually duplicate? Do edge cases make this a horrid monster? Should there be a maximum time period between the stop and restart? What happens if the user moves files or directories around between the stop and restart? Specifically what happens if a file or directory is moved from the non-backedup (directory, tree or what ever) group to the backed up group? Will a naive user believe they have a perfect backup but actually there are files or directories missing? A core principle of restic is easy to understand that there is a good backup and this option makes me shiver. I am a former database administrator so the complexity of trying to backup pieces and resuming becomes complex.

However I completely agree that the first backup of a repository takes a long time. A new annual, non C: drive, repo for me took 19 hours, version 0.11.0 windows. Incremental backups are 5 minutes. I would like to see resumable initial backup but would require that the backup be validated to ensure there is a full point in time backup.

alexweiss · January 18, 2021, 5:49pm

Note that there are no “inclemental” backups within restic. In your case you use restics “parent snapshot” functionality which speeds up the subsequent backup but does still save a snapshot referencing a full backup.

By using the parent snapshot, restic scans this already saved snapshot (which is a full backup) and compares file names, modification time, file size, change time and inode number. If all match (or you exclude some tests by extra flags), restic assumes the file is equal and will not read, chunk and hash it - which saves a lot of time.
If one of the tests fails, the file will be read and backuped again (with parts deduplicated if possible, of course). Also note that as additional check, restic checks that the needed file content really exist in the repository. If not, the file is treated as “modified”.

So if you trust this “change detection” algorithm, you can in principle use any parent snapshot you want and still get a “perfect” full backup. Choosing the “right” parent is only essential for performance, but not for the backup result.

Now, my proposed PR #3230 does exactly the same. But instead of using only one or more parent snapshots, it additional uses the information about saved trees from a previous backup if this was aborted - as for aborted backups there exists no snapshot.

onionjake · January 19, 2021, 4:01pm

Yes, loading any parent tree will necessarily not cause any issues with moved, changed directories/files. It also stands to reason that loading any arbitrary tree would not cause you to “miss” any files.

What isn’t clear to me is if the chunking mechanism is necessary to detect if the actual content is missing in the backend. Say for example your do half of your first backup, then run prune (which because there are no snapshots the data is all removed), then you load some cached tree/log from the first backup run. Leveraging the ‘cached’ tree will cause it to skip some files, but will restic notice they are missing in the backend? Would it require a ‘rebuild-index’ or a ‘check’ first? @odin touched on this point:

But what wasn’t clear to me is if the check @odin mentions would be a new one that would need to be added or if it is already one that restic does.

alexweiss · January 19, 2021, 5:46pm

It is not neccessary if you trust the “change detection” algorithm.

There are two cases here:

The tree blob is missing in the backend. In this case, this “parent” tree will simply be ignored, i.e. this directory is treated as “new” directory (all files are read and chunked)
Some data blobs are missing in the backend. If they are also missing in the index,
the according files are treated as “modified” and therefore will be read and chunked.

No extra rebuild-index, as prune does also rebuild the index. Of course, if your repo is corrupt, you may get “wrong” backups, e.g. if a needed chunk is contained in the index but in fact deleted. However, this is a general problem and not specific to using parent snapshots or trees from aborted former backup runs.

onionjake · January 19, 2021, 6:12pm

Thanks @alexweiss. Given what you have explained I’m pretty sure there isn’t any special edge cases dealing with cached data being ‘out-of-sync’ with the backend. Like you said, the “change detection” algorithm is already robust enough to handle any discrepancies (even if you have a stale cache or an outright incorrect cache).

There might still be some edge cases with the cache that need to be thought through, but when in doubt ignoring or removing the cached data is always a correct fallback.

I’ll take a closer look at your PR #3230 as it seems like the most promising approach.

alexweiss · January 19, 2021, 7:52pm

The algorithm fails if you have a modified file but file name, size, modification time, change time and inode number stays the same (however you may be able to achieve this…). Then restic will not save this modified file. This however also happens for backups that use parent snapshots.

Besides this I cannot think of an edge case that would go wrong. Either the the file to backup is found in the tree and matches or it is treated as new file…

upsidedown · January 20, 2022, 7:54am

I want to add that this (not being able to resume a backup) is specially troublesome when backing up files from cloud locations.
I have mounted a remote in rclone and was syncing between the mounted remote (which is a crypt Google Drive) to a new restic snapshot and my ISP had some problems and Internet went off for 30 or 40 minutes, because of this restic has to re-start from the beginning the backup and, if true that it wont re-upload the same files, it has to make checks and to do that it needs to download every single file from the mounted remote which takes a lot of time as the remote is big (12TB).