Huge amount of data read from S3 backend

vfauth · December 3, 2019, 8:33am

Hi.

I am using restic 0.9.6 with S3 as a backend. I have around 69GB of data, which seldom changes, so it takes around 70GB in the repository. I backup every 3 hours (8 backups/day).

I noticed S3 is charging me a LOT of money for “data trasnfer out”. Based on my calculations, that’s more than 600GB/day.

It seems that may be the forget --prune which uses so much bandwidth.

Is that a normal behaviour?

PS: That feature would be awesome: https://github.com/restic/restic/issues/2239
PPS: Apart from that, restic is awesome, thank you!

moritzdietz · December 3, 2019, 4:45pm

Hey @vfauth and welcome to the restic community forum!

The restic prune or restic forget --prune command is very expensive, but it shouldn’t be that expensive.
Expensive in terms of how it handles data and transfers; Even on local storage.

It would be interesting to see what the restic output is from both a backup and when you apply your forget command afterwards.
So if you could post us the exact commands you’re using, this would be great.
This then could give us some indicators what’s going wrong.

cdhowie · December 5, 2019, 2:58am

Yes. prune downloads every pack header to create a temporary index, crawls all snapshots (which means downloading every tree object that can be reached from any snapshot), downloads any blobs that are still used and exist in the same pack as an object to be deleted, re-uploads these blobs, deletes the old packs, then reindexes again (downloading every pack header a second time).

If you do this frequently, the traffic adds up pretty quickly.

David · December 19, 2019, 1:19am

I often recommend that restic users consider backing up to a local storage system, then synchronizing that repository to S3 as a second step (using aws s3 sync or rclone).

This has several benefits:

It optimizes the data transfer to/from AWS
Many operations (e.g. restore, prune, forget) are much faster and less expensive
It provides you with BOTH a local and remote copy of your data (supporting the 3-2-1 backup strategy)
It allows you to perform prunes/forgets without requiring any reading of the cloud repository. This, in turn, means that the remote repository can live in low-cost/high-latency storage classes like Glacier or Glacier Deep Archive.

nunop · July 4, 2020, 10:20am

@cdhowie @moritzdietz

Seems I’m being affected by this, after running some prunes, and my data transfer costs increased significantly. (~700GB total)

I have a few questions:

If I run a “restic forget --prune” that deletes 5 snapshots, does it use the same traffic that would be used if I run 5 individual prunes, by specifying the exact snapshot names?
Or does “forget” use less traffic because it somehow reduces the number of downloads, since it’s handling multiple prunes at the same time?
Does running “restic stats” (with or without “–mode raw-data”) require significant amounts of data too?
Does running “restic check” require significant amounts of data too?

Thank you very much!

alexweiss · July 4, 2020, 11:58am

Note that forget only accesses/removes comparably small files under snapshot and those are usually contained in your local cache. If you specify snapshots to remove directly, it will just remove those snapshot files. Else it reads all snapshot files to determine which one to delete.
As a rule of thumb: forget does not need much traffic

Another thing is if you issue --prune to forget. This is just a shortcut to running forget and then prune.

prune actually always causes much traffic for remote repositories.

No, it uses only metadata which is usually also contained in your local cache.

If you don’t specify --read-data it also only uses metadata that is cached. The default option, however, is not to use the cache which means that this metadata will be downloaded from a remote repository. You can specify --with-cache.
If you use --read-data (or --read-data-subset) it will have to download much data from your remote repository.

alexweiss · July 5, 2020, 5:41am

Just to mention:
As we are talking of S3 (or many other object storages I know) there is no way to “download the pack header”, which would mean to download only a part of a file. As this backend only provides to download the whole file, this what restic does - and as @cdhowie described, with actual prune implementation it does it twice for every data file

This is the big advantage when using restic’ s REST-server - it will just transfer what is needed.

BTW: the same happens when restoring from an S3 backend. If you need only a single blob out of a pack file it will download the whole packfile. This however is not that severe as you 1. usually do not restore that often and 2. needing many blobs or large blobs within a pack is not too uncommon.

EDIT: This is not true, see below.
If your storage supports downloading only ranges and the backend implementation used within restic implements this, only the needed parts of the pack are downloaded. This is the case e.g. for S3 backend or the REST backend.

sebastien.gross · July 6, 2020, 8:04am

Depending on your needs and budget you can use a proxy storage:

Backup onto a regular server
Perform maintenance on that serveur (prune, forget, reindex, etc)
Push from that server to S3 or other.

It may cost you less and also provide a second level of backup.

ifedorenko · July 8, 2020, 2:17am

I don’t believe this is accurate. S3 GetObject supports Range header, and restic (re)indexer is taking advantage of that, see pack readHeader. Likewise, restore will only download smallest pack file Range that includes all needed blobs.

alexweiss · July 8, 2020, 5:10am

Ah, thanks for clarifying - I didn’t know that this is supported by S3.
Checked the implementation of S3/Minio backend and you are right, this seems to be implemented.

Eli6 · September 26, 2021, 10:29am

How much is the approximate percentage of data downloaded during prune and check operations (without —read-data which downloads everything)?

For example, is it feasible to store 1 TB on S3 standard or infrequent access and prune the repository daily or weakly, without noticeable bandwidth charges?

Eli6 · September 26, 2021, 10:32am

How much is the approximate percentage of data downloaded during prune and check operations (but not with read-data which downloads the whole repository)?

For example, if I have 1 TB on S3, how much is downloaded during prune approximately?

Users who have used S3 or similar, how much are bandwidth surcharges arising from daily prune operations in such case?

Is there a caching solution?

MichaelEischer · September 26, 2021, 11:54am

If you want to minimize the amount of downloaded data then use prune --repack-cacheable-only. However, that option will probably cause quite a bit of storage overhead. The assumption there is also that the host running backup and prune are the same, allowing restic to use its cache. Otherwise it becomes necessary to download the whole repository index again after a prune run.

Prune will have to download all snapshots, index and tree data packs unless they are already cached. Essentially with prune you have to trade off wasted space versus download size. The more unused space may exist in a repository, the more often restic will be able to just delete whole pack files and thereby avoid downloading them.