"Fast path" adding trees after interrupting backup

cdhowie · May 6, 2019, 8:28pm

To address the issue raised in this thread about a backup taking too long, it would be helpful if restic could reuse uploaded objects without having to re-hash everything – effectively using a parent snapshot where there isn’t one. This is actually potentially feasible.

We already have tree IDs as a hash of their contents, however this hash includes data hashes which are not known without hashing file contents.

What if trees had an “alternate ID,” which is a hash of a subset of this information: in lexicographical order, the following information about each sub-node is added to the hash:

Name
For files:
- Size
- Mtime
- Inode
- Owner/group
- Mode
For directories:
- The alternate ID for that directory.

For this to be fast, the alternate ID of each tree would have to be included in the index.

Then, when backing up, a new phase is added before the current archiver phase, possibly as part of scanning: we perform a depth-first crawl of the backup path(s) and compute the alternate ID of every directory on the way back up. If there is a match in the index, then that tree is already backed up. We can extract the tree’s actual ID using the alternate ID as a lookup and prune that entire tree from the backup.

This approach would have the potential to replace the current “parent” mechanism entirely by giving restic the ability to reuse trees from all other snapshots.

As with the current “parent” mechanism, a flag could be added to disable this enhancement.