Uploading details

Hi, can someone please explain how data is uploaded, especially to a cloud backend.

I read the Docs/References section and understand only some of it.

Suppose I have a million files, each one 2KB in size. Does restic do a million transfers? With borg, for example, I think the uploaded files are grouped somehow and go 5MB at a time for network efficiency.

And if I have 3 files, each one 200MB each, what happens then?

thanks

The design section in the docs should hopefully clear things up for you: References — restic 0.12.0 documentation (direct link)

Backups and Deduplication

For creating a backup, restic scans the source directory for all files, sub-directories and other entries. The data from each file is split into variable length Blobs cut at offsets defined by a sliding window of 64 byte. The implementation uses Rabin Fingerprints for implementing this Content Defined Chunking (CDC). An irreducible polynomial is selected at random and saved in the file config when a repository is initialized, so that watermark attacks are much harder.

Files smaller than 512 KiB are not split, Blobs are of 512 KiB to 8 MiB in size. The implementation aims for 1 MiB Blob size on average.

For modified files, only modified Blobs have to be saved in a subsequent backup. This even works if bytes are inserted or removed at arbitrary positions within the file.

Or have you read that already and still have questions?

The big files are split in parts (blobs) of 512 KiB to 8 MiB. The small files are not splitted, they are own blobs.
The blobs are combined to so called pack files with a size of max 8 MiB which are stored in the repository

1 Like

Basically: the design docs site spells it out completely and should be a good start :slight_smile:

I did read that whole thing, as I mentioned :slight_smile: But I am not so educated in the jargon. So it doesn’t clearly (to me) answer my questions:

  1. Suppose I have a million files, each one 2KB in size. Does restic do a million transfers or ops to the cloud backend? Or are they combined as busybee says into max 8MiB pack files before upload?

  2. And if I have 3 files, each one 200MB each, how many transfers/ops over the network? Would that be about 200/8 = 25 upload ops?

I ask because each op takes time, and with some backends there is a price for upload operations.

  1. One million transfers, if the files are unique.
  2. Yes.

This is only correct if each file is backup’ed by its own restic backup run. (in this case “remaining” packs are also saved if they are not yet full)

@busybee is right here: In this case, each file equals a blob. Within a backup run, each unique blob is added to a pack until the pack is full. This is at around 8MB, i.e. in your case a pack contains around 4000 blobs/files. Then the pack file is saved to the cloud storage.
In your case of 1 million files a 2KB you’ll get around 250 pack files, i.e. around 250 transfers - or less if many of the files are equal.

Note that there are also experimental PRs which allow to increase the packsize. These would further drop the number of uploads.

3 Likes