What is the purpose of this feature request? What does it change?
The purpose of this feature request is to reduce the amount of data restic downloads from repository during recovery. This would reduce costs of recovery from repositories located on cloud storage that charges for downloads and could reduce the recovery time.
Problem description
As discussed here (What is restic restore even doing?) the amount of data downloaded from the repository by a restic restore is the size of the deduplicated data. That is, to restore a 100 GiB data set that requires 80 GiB repository size (due to deduplication), a restic restore will download 100 GiB of data. I believe restic should be able to benefit from deduplication also during the restore, that is to restore 100 GiB of data download of only 80 GiB of the unique data should be needed.
Current state
I’m not a programmer, and in particular I’ve never even looked at Go before, but from what I could grasp from restorer.go, the way restic restore works is:
- Walk the snapshot.
- Find files that should be restored.
- For each file marked for restore pull from the repository the data needed to restore that file.
Example
User requests restore of files a, b and c. Files a and c are identical.
Restic downloads repository file A1 and writes partial content to file a.
Restic downloads repository file A2 and writes partial content to file a.
Restic downloads repository file A3 and completes restore of file a.
Restic downloads repository file B1 and writes partial content to file b.
Restic downloads repository file B2 and writes partial content to file b
Restic downloads repository file B3 and completes restore of file b.
Restic downloads repository file A1 and writes partial content to file c.
Restic downloads repository file A2 and writes partial content to file c.
Restic downloads repository file A3 and completes restore of file c.
Please note that the above could be patently wrong.
Proposed approach
What I envision is a modified process. A new restore option would make restic restore:
- Walk the snapshot.
- Find files that should be restored.
- Identify the set of repository files it needs to restore all required files.
- Pull (preferably in parallel) repository files needed to restore and write all data from the repository file to the proper locations in the restored files. This seems clear as mud, I’m afraid, but bear with me and see the example.
Example
User requests restore of files a, b and c. Files a and c are identical.
Restic restore finds that it requires files A1, A2, A3, B1, B2 and B3 to restore the data.
Restic downloads repository file A1 and writes partial content to files a and c.
Restic downloads repository file A2 and writes partial content to files a and c.
Restic downloads repository file A3 and completes restore of files a and c.
Restic downloads repository file B1 and writes partial content to file b.
Restic downloads repository file B2 and writes partial content to file b
Restic downloads repository file B3 and completes restore of file b.
Benefit / Motivation for the change
Reduced amount of data downloaded from repository to complete a recovery. This would definitely reduce recovery cost from repositories that charge for downloads (e.g. cloud storage). This might reduce recovery time, because each piece of data required for recovery would only need to be downloaded once.
Cost
Additional effort to implement modified restore function.
Probably increased memory footprint.