Quicker interrupted backup resumption

True. I thought this topic is about unreliable internet connections. The underlying problem is discussed in issue #2960

This is basically the metadata restic saves in the tree blobs. It is path, size, various times, inode…
But to also create a new snapshot and trees which also contain these skipped files, it also needs the references to the file contents.

I was thinking about consistency problems in that case. restic does already check for the existence of the blobs needed for the file contents and re-creates them if they are missing (the backup “self-healing” feature). So in this case the only thing which might lead to a wrong result for the backup would be wrong file contents references in the log (but which exist in the repo). In that case, restic would just take the wrong contents.

Besides this - should be working. And of course has the advantage that it does also work when the internet connection suddenly stops.

If there is such a logging option, I would however make it optional which may lead to the problem that as you realize you need it to “resume” a backup, you most likely did not enable that needed option…

I thought a bit about the comments of @odin and @onionjake. In fact a some kind of periodically written journal would be needed to properly resume a backup. But in fact, restic already has this kind of journal - it writes tree information to the repository and even updates the index regularly.

If a backup is aborted (by whatever reason), there are tree blobs which are saved and available. So why not simply use them? In fact, there are quite some problems:

  • These trees are never referenced and we need to first find them (this is similar to what the recover command is doing)
  • Even worse, there might be other unreferenced trees, e.g. produced by the new prune algorithm
  • As the trees are not referenced, there is no tree structure. So restic does not know which path to use. E.g. a tree containing files and (files in) subdirs a , dir1/b, dir2/b could be the files /home/user/a, /home/user/dir1/b, /home/user/dir2/b or the files /mnt/a, /mnt/dir1/b, /mnt/dir2/b. So there must be some trying…

So, my proposal would be:

  1. Let the new prune remove all unneeded tree blobs
  2. Add an option to backup which finds all unreferenced trees and then tries to “match” them. That would be an optional possibility which is to be used exclusively to speed up large resuming backups.

Just thinking into the blue here, but … wouldn’t it be possible to leave information about the trees restic was working with in the cache directory in the event of a crash, and check for (and validate) that data at the start of a backup?

(Edit: By which I obviously mean “do it always, so the files are there in the event of a crash”.)

Kinda thinking along the same lines as @odin, you’d save a partial root tree periodically that knows about all the trees backed up so far? Then the next backup would load that tree and keep going from there. Then all the existing ‘skip’ logic works as is and if it is in a well known location you wouldn’t need to add any new flags.

In principle I like your ideas.

The problem is that IMO saving root trees periodically is only possible with a re-engineering of the archiver code base which I doubt could be tackled in the near future.

The problem about using the cache is the data handling: How/when to remove this information in the cache? How to keep this data in sync with the repo - there might be many backup procceses writing to the repo. I would favor saving this information in the repo directly, but this would need an update of the repo format in some way…

I added these PRs:

Edit: #3229 is closed as it turned out that finding unreferenced trees is too expensive.

Just to expand a bit on that: The archiver recursively descends into the directories to backup and on its way back assembles the trees. Saving a root tree would thus require the archiver to stop all processing so it can writea root tree. And then the archiver would have lost all intermediate state and has to start over again…

Once there’s data stored in a “cache” which is not available at the data source then it’s no longer a cache but rather an additional (temporary) data store which probably has all sorts of consequences.

Oh, but I wasn’t even thinking that far. All I was suggesting was that you could save yourself the ‘find unreferenced trees and match them’ step, by saving a log of trees uploaded to the cache directory that would a.) get left behind if the backup process is interrupted and b.) cleaned up when the command finishes.

I don’t know whether that does actually let you resume faster, but it gives an opportunity to solve all three of the issues you mentioned earlier, namely:

  • The trees aren’t referenced in the repository, but there would be local references. No searching needed.
  • There might be other unreferenced trees, but since you aren’t searching, you won’t find anything irrelevant.
  • You could store data relevant to the local status - such as the path being backed up - in the local cache, removing the guesswork.

If you’re not storing anything new to the repository, how could it be worse than currently? If the data that was uploaded has been pruned, well, you start by checking for it and if it’s not there, do what’s being done now. Am I missing something?

I was trying to suggest a log kept locally by the process writing to the repository that would be removed on success, which could enable a speed-up on resumption but in the absence of which everything should fall back to present behaviour. Wouldn’t that count as a cache?

I’m perfectly aware there may be reasons this is technically infeasible; I was just going off the assumption that the suggestion about using tree blobs already saved to the repository was a reasonable one.

@odin I made a PoC using your ideas. See

The point of the “resuming” is to prevent the duplicate work. But how does the program know that the work is actually duplicate? Do edge cases make this a horrid monster? Should there be a maximum time period between the stop and restart? What happens if the user moves files or directories around between the stop and restart? Specifically what happens if a file or directory is moved from the non-backedup (directory, tree or what ever) group to the backed up group? Will a naive user believe they have a perfect backup but actually there are files or directories missing? A core principle of restic is easy to understand that there is a good backup and this option makes me shiver. I am a former database administrator so the complexity of trying to backup pieces and resuming becomes complex.

However I completely agree that the first backup of a repository takes a long time. A new annual, non C: drive, repo for me took 19 hours, version 0.11.0 windows. Incremental backups are 5 minutes. I would like to see resumable initial backup but would require that the backup be validated to ensure there is a full point in time backup.

2 Likes

Note that there are no “inclemental” backups within restic. In your case you use restics “parent snapshot” functionality which speeds up the subsequent backup but does still save a snapshot referencing a full backup.

By using the parent snapshot, restic scans this already saved snapshot (which is a full backup) and compares file names, modification time, file size, change time and inode number. If all match (or you exclude some tests by extra flags), restic assumes the file is equal and will not read, chunk and hash it - which saves a lot of time.
If one of the tests fails, the file will be read and backuped again (with parts deduplicated if possible, of course). Also note that as additional check, restic checks that the needed file content really exist in the repository. If not, the file is treated as “modified”.

So if you trust this “change detection” algorithm, you can in principle use any parent snapshot you want and still get a “perfect” full backup. Choosing the “right” parent is only essential for performance, but not for the backup result.

Now, my proposed PR #3230 does exactly the same. But instead of using only one or more parent snapshots, it additional uses the information about saved trees from a previous backup if this was aborted - as for aborted backups there exists no snapshot.

Yes, loading any parent tree will necessarily not cause any issues with moved, changed directories/files. It also stands to reason that loading any arbitrary tree would not cause you to “miss” any files.

What isn’t clear to me is if the chunking mechanism is necessary to detect if the actual content is missing in the backend. Say for example your do half of your first backup, then run prune (which because there are no snapshots the data is all removed), then you load some cached tree/log from the first backup run. Leveraging the ‘cached’ tree will cause it to skip some files, but will restic notice they are missing in the backend? Would it require a ‘rebuild-index’ or a ‘check’ first? @odin touched on this point:

But what wasn’t clear to me is if the check @odin mentions would be a new one that would need to be added or if it is already one that restic does.

It is not neccessary if you trust the “change detection” algorithm.

There are two cases here:

  1. The tree blob is missing in the backend. In this case, this “parent” tree will simply be ignored, i.e. this directory is treated as “new” directory (all files are read and chunked)
  2. Some data blobs are missing in the backend. If they are also missing in the index,
    the according files are treated as “modified” and therefore will be read and chunked.

No extra rebuild-index, as prune does also rebuild the index. Of course, if your repo is corrupt, you may get “wrong” backups, e.g. if a needed chunk is contained in the index but in fact deleted. However, this is a general problem and not specific to using parent snapshots or trees from aborted former backup runs.

Thanks @alexweiss. Given what you have explained I’m pretty sure there isn’t any special edge cases dealing with cached data being ‘out-of-sync’ with the backend. Like you said, the “change detection” algorithm is already robust enough to handle any discrepancies (even if you have a stale cache or an outright incorrect cache).

There might still be some edge cases with the cache that need to be thought through, but when in doubt ignoring or removing the cached data is always a correct fallback.

I’ll take a closer look at your PR #3230 as it seems like the most promising approach.

The algorithm fails if you have a modified file but file name, size, modification time, change time and inode number stays the same (however you may be able to achieve this…). Then restic will not save this modified file. This however also happens for backups that use parent snapshots.

Besides this I cannot think of an edge case that would go wrong. Either the the file to backup is found in the tree and matches or it is treated as new file…

I want to add that this (not being able to resume a backup) is specially troublesome when backing up files from cloud locations.
I have mounted a remote in rclone and was syncing between the mounted remote (which is a crypt Google Drive) to a new restic snapshot and my ISP had some problems and Internet went off for 30 or 40 minutes, because of this restic has to re-start from the beginning the backup and, if true that it wont re-upload the same files, it has to make checks and to do that it needs to download every single file from the mounted remote which takes a lot of time as the remote is big (12TB).

@upsidedown Feel free to try the PR backup: Add resuming from aborted backups by aawsome · Pull Request #3230 · restic/restic · GitHub and report if this helps you!

1 Like

Could you provide a Windows build?

The best way is to get the source from GitHub - aawsome/restic at backup-resume (git clone or just download it), check the changes and then run
go run build.go to get the binary. This way you don’t have to trust someone (me) about the binary.

Nevertheless, I just sent your a link to the binary in a PM for convenience.

2 Likes

Thanks, I only found docker instructions.