ETA is wrong when resuming an upload

Hi,

If I resume an upload (say 20GB into a 40GB transfer) then the ETA is way too optimistic. For example, it assumes I can upload the remaining 20GB in 5 minutes instead of the 10+ hours it’ll take. The ETA keeps on increasing for hours until eventually near the end it drops to zero.

Is this a known bug? I couldn’t find an open bug report in github.

Version v0.11.0-207-gf7c7c2f7
b2 backend
Windows 10

Yes this is known. And there is not much that can be done about it: An accurate estimation would require simulating the full backup run beforehand. However, that would require reading everything twice…

The case you’ve described is pretty much the most complex one to get right: restic has to read the whole 40GB again, but as the first 20GB already exist in the repository is able to deduplicate the data without uploading anything. And then suddenly the second half of the data takes a lot of time to upload.

There are ideas to let restic create “partial” snapshots for interrupted uploads: see Save an "interrupted" snapshot when backup is interrupted by SIGINT · Issue #2960 · restic/restic · GitHub and support for merged backup parents · Issue #3118 · restic/restic · GitHub . These would allow restic to start approx. where it was interrupted before, which could (with additional work) help to improve the ETA.

It sounds to me like you are trying to solve a much more complex problem than what I have in mind. Let’s focus exclusively on the ETA calculation for the moment (forget about how to save interrupted chunks). What prevents you from doing the following?

  1. Calculate a running average of the data processing rate, excluding any data that is not uploaded as part of the current session. Meaning, if I am 20GB into a 40GB upload but the first 20GB was skipped because it was already uploaded then it would not be included in my running average.
  2. Divide the amount of data left to process by the processing rate from step 1 to derive an ETA.

Thoughts?

At the start (or rather after a short time) of a backup run (no matter whether it was interrupted previously or not) restic can only determine that it has to backup data amounting to 40GB. Before processing the data it doesn’t know whether the data can be deduplicated and thus there’s no need to upload. So in the beginning the data processing rate is as fast as your disk can read data and it drops dramatically once new file chunks have to be uploaded.

Currently the estimate is calculated based on the complete backup run. It would be possible to just consider the processing rate of the last few minutes. However, for large file with just small changes (e.g. VM images) that will still lead to completely off ETAs: the file seems to be changed and thus has to be included in the ETA calculation, but then the deduplication is able avoid uploading nearly every file part. Which will lead to large jumps in the ETA.

It still sounds like you are including non-uploaded data as part of your “processing rate” definition. I am expecting the ETA calculation to only get take uploaded data into consideration.

Case 1: You are processing a 40GB backup, of which the first 20GB was already uploaded… I am expecting the processing rate to be zero at the 20GB mark because we just began to upload.
Case 2: You are processing a large VM image but only a small portion needs to be uploaded due to deduplication… I am expecting only the small portion to be included in the processing rate calculation.

If for whatever reason you are unable to differentiate between uploaded vs skipped data, then you could limit the running average to only consider the past 5 minutes of data as you mentioned. Within 5 minutes we should start seeing an accurate ETA.

The problem is that restic can spend quite some time to process files without uploading anything. In case of a 50GB VM image with essentially no changes and an HDD which can read 100MB/s it’s rather simple to arrive at scenario where no data is uploaded for 5 minutes and then the ETA is broken. Dividing the bytes to be processed (which cannot know yet how much will be deduplicated) by the upload rate will result in a permanent overestimation.

restic only knows how much data it would have to upload right at the end of a backup run and not a moment before. It is just not possible to accurately estimate beforehand how much data still has to be uploaded, as the deduplication only happens while processing the backup data.

Take for example the 50GB VM image, the first 20GB could deduplicate perfectly, then there’s 1GB with changes, which is uploaded at 1MB/s. This gives use 30GB remaining at 1MB/s = 8 hours ETA. However, the next 29GB can also be deduplicated. Which results in 1000 seconds upload time and 500 seconds processing = 0.42 hours. In which case the 8 hours ETA was a rather bad guess.

Makes sense. So from the sound of it, you don’t process pending data for deduplication way ahead of the network upload. In the case where hard drive speed vastly outstrips network speed, is there a harm to continue processing all remaining data so we know what will end up getting deduplicated and provide a more meaningful ETA? I assume this will take more memory but I don’t know if it’s a meaningful amount or not.

The problem with deduplicating data way ahead of the upload is that restic would have to store temporary files with deduplicated data somewhere. And in the worst case these temporary files could be as large as the data to backup. Thus that isn’t really an option.

I was envisioning a two-pass system where the first pass calculates the hash of each file or block that gets deduplicated and the second pass actually sends it. You’d be able to detect duplicates and keep a running total of upcoming bytes to send, without actually storing the deduplicated content.

I’m probably unaware of implementation details that prevent this from working. I just wanted to run this possibility by you in case you hadn’t considered it already. Thank you for taking the time to discuss this.

Kind regards,
Gili

At the cost of reading the data to be backed up twice.

That seems like an expensive thing to do just to make an estimate more accurate.

2 Likes