Support parallelism when backing up very large files

leoluan · July 9, 2021, 1:20am

From a quick look at the code, I gather that Restic does not support multiple threads per file being backed up. This is ok when using Restic to back up a large number of files if the largest file(s) and the other smaller files take a comparable amount of time to process. But if the tree being backed up contains just one very large file and not many smaller files, the arch.Options.FileReadConcurrency parameter is insufficient to maintain the parallelism in the backup process. One file-read thread will be processing that very large file a long time after the other threads finish processing the smaller files. Supporting this use case better can make Restic useful and performant when backing up large virtual disk files (TBs in size) used in virtual machines or containers. Other Restic features such as dedup, compression, encryption, snapshot, cloud backend support, are applicable and valuable for VMs and containers. Supporting large virtual disk files well would help Restic extend its application in these environments. @fd0 Would such extension/contribution be welcome to Restic?

One way to do this may be extending Restic to allow large files to be broken into composite files and letting multiple backup threads process the same file concurrently. Any insights on existing discussion/work, a better way to do it, or warnings on challenges/issues to pay attention to are welcome.