Do incremental off-site backups require to download previous backup(s)?

axeleroy · May 19, 2020, 9:11pm

Hello,

I have been looking to use Restic for off-site backups and I was wondering about its behavior with incremental backups.

For example, Duplicity has to download every previous backups in order to calculate the delta before creating a new backup.

Does restic have a similar behavior or does it rely on a local cache as it is implied in this answer?

Thank you.

doscott · May 19, 2020, 11:48pm

No, and restic doesn’t do incremental backups in the sense that that term historically applies (master backup plus deltas based on either the master or the last delta).

To simplify, when you perform a backup you create a file list (snapshot) of all files that exist at the time of the backup. Each file in that list is checked to see if it has already been backed up before; if not it is uploaded. When you do a forget you are getting rid of a list or lists (snapshots). When you do a purge you are checking for which files are no longer referenced in any remaining lists (snapshots) and they are deleted. As I said, this is a simplified description of what happens.

The operation that can impact ingress/egress from the cloud the most (after the first backup), is the purge, as files are not stored individually, especially if a number of files have been deleted.

If you go through the last month or two of posts you will come across a number of good discussions that explain things in more detail, and you will see that it is easy do get a listing the “incremental” change between two snapshots (I do this on every backup, more to see what has disappeared than has been added).

Re the cache, that speeds up determining what files already exist in the repo, and can be bypassed with a force option on a backup. It slows things down, but the data pulled from the repo is not the backed up files, so data usage is minimal.

Hope this helps.

alexweiss · May 28, 2020, 5:32am

More be more precicely, restic saves a repository in four kinds of files:

one snapshot file for each snapshot (one is created with each backup run)
a bundle of index files
data files containing metadata (the so-called “tree blobs”). This are filenames, file times, etc.
data files containing files contents (the so-called “data blobs”)

When doing a backup, restic always needs to read the index files. By this information it knows which blobs are already present and if they are it doesn’t save them again (this is the so-called “deduplication”).

If you do a follow-up-backup restic tries to automatically make a snapshot based on a previous (so-called “parent”) snapshot. To do so, it reads all snapshot files to find that parent snapshot. Then it reads the metadata of all the directories/files within this snapshot. By this information it can identify which directories/files have really been modified and only needs to process these (note that the generated snapshot still yields a full backup!). This allows backup speeds like doing only a incremental backup.

Summarizing, for a follow-up backup to a off-site repository restics needs to access all index files, all snapshot files and some data files.

These kind of files are, however, cached by default. So if you run the follow-up backup from the same machine and did not clean up any cache, they are read from the cache and not downloaded from the repository.

Note that during backup restic never needs to read the data files containing the actual files contents which are usually more than 99% of the repository size.

A few side remarks:

restic does need the information from the repository which files are present within the repository. So it basically does some a list or ls in the repository data structure (depending on the storage type you are using)
I omitted some kind of files like config file, key files and lock files. Those are read from repository as they are actually not cached. This is also the reason why restic does not work well with “write-only” storage or “cold storage” (with large access times). I made some PR which add options such this can be fixed. With those options, restic backup really only needs listings of the repository and does read all data from the cache (if present there).

axeleroy · May 28, 2020, 7:19am

Thanks @alexweiss for your answer, it’s much clearer now.

restic does need the information from the repository which files are present within the repository. So it basically does some a list or ls in the repository data structure (depending on the storage type you are using)

So I guess that if I use S3 as the repository with a policy that automatically moves files to Glacier after a few days, restic won’t be able to make follow-up backups?

alexweiss · May 28, 2020, 9:29am

I did not use S3 Glacier so far, but I would guess that an aws s3 ls would also work within S3 Glacier. So applying this policy to only files under /data/ should perfectly work (if you just make backups with a “parent” snapshot where the metadata is in your local cache). If you intend to make a policy for the whole repository you’ll get trouble with key, config and lock files. Then you can think about patching your restic with some of the PRs I proposed, see restic with "cold" storage (here: OVH Cloud Archive) · Issue #2504 · restic/restic · GitHub

cdhowie · May 29, 2020, 12:48pm

Using the Glacier or GDA tiers with restic is fraught with caveats and expense. I would strongly suggest not using them.

If S3 storage is too expensive, consider using Backblaze B2.

axeleroy · May 29, 2020, 1:44pm

Well, I was considering using Scaleway’s Object Storage and C14 Cold Storage offerings in order to get even cheaper storage than B2, but you definitely convinced me to stick with B2 (which is cheaper than Scaleway’s Object Storage alone).

David · June 1, 2020, 9:05pm

So I guess that if I use S3 as the repository with a policy that automatically moves files to Glacier after a few days, restic won’t be able to make follow-up backups?

This is pretty similar to my approach. I do the following:

I run restic backup against a local repository (on another RAID volume)
After the backup is complete, I perform aws s3 sync [repo path] [s3 bucket] --delete to synchronize any changes to S3
I have S3 lifecycle rules that moves data/ files in my repo to Glacier Deep Archive after 7 days. Only the data directory (which is like 98% of the total repo).

This ticks off so many of my goals:

Implements US-CERT’s 3-2-1 backup recommendation (2 local copies and 1 remote copy of data)
Very fast performance for use cases that use the local copy
I can directly access the cloud copy using the restic client for many tasks (lists of snapshots, diff snapshots, etc)
Pruning is easy. I perform prunes against my local copy weekly.
Pruning is low-cost (for my data dynamics). My GDA early deletion fees are negligible (< $0.02 per month). For people with more dynamic data, this could he higher, but could perhaps be mitigated with a longer delay before moving to GDA.
Very low cloud-storage costs. (My Amazon costs are about 30% of what I used to pay at Wasabi.com)

alexweiss · June 9, 2020, 5:23am

Just a side remark about the pruning issue:

Once that PR #2718 is merged, the picture IMO changes completely.

With this PR you can enforce that the data files containing only file contents (“data blobs”) are never read during prune - but only deleted if no longer needed. (note that “tree blobs” will be still needed, so they are downloaded if they are needed and not present in the local cache, e.g. if you backup from many machines into one repository).

IMO this cures the main issues @cdhowie mentioned when using S3 Glacier. Then there will be no longer the technical need to prune locally and sync to Glacier.

tomwaldnz · December 11, 2020, 7:57pm

I’m thinking some more about using AWS Glacier Deep Archive for a “last restore” Restic repository. I’m considering an approach of doing a backup of my data to a local disk, then using the AWS command line program (aws s3 sync) to copy the restic repo up to AWS S3. I’d have a lifecycle rule that moves files in the “data” folder to deep archive class. I’d also copy the repo to an offsite disk. The repo would never be downloaded unless my local backups failed. Can anyone see any problems with this approach?

The other approach mentioned above is for Restic to back up to S3 directly, and use a lifecycle rule to move the /data folder to Glacier DA class. I wouldn’t bother pruning the repo, my data is rarely deleted, but regularly added to (e.g. family photos are added but not deleted). My understanding is /data is never read / deleted if a prune / validate / restore isn’t done, and S3 provides a list of files in each folder regardless of the storage class of the object. With that in mind, is there any reason Restic would need to read the contents of a file in the data directory?

I understand B2 is an easier option, which is a bit cheaper than S3 IA but more expensive than S3 DA. I’m an AWS specialist and prefer to keep my data there where practical and economic, but B2 is a quality storage provider I’d consider using if I couldn’t use S3 DA.

alexweiss · December 12, 2020, 6:43am

Besides considering the potential costs for retrieving your data in a worst case scenario (which I guess you already did) I don’t see any problems.

There are two kind of files under /data: files containing data blobs and files containing tree blobs.

The data blobs are only accessed when a file content is needed or when repacking. So, if you don’t restore, cat, mount etc. and don’t repack during pruning these files are never accessed (only listed).

The tree blobs are used more widely, basically every time you need to access the tree structure of your backup, e.g. during a diff, ls. But the trees are also used to speed up a subsequent backup (by comparing times and file sizes of the trees with your data). or for pruning to decide which data blobs are needed and which not. So those are under frequent access and restic basically doesn’t work without accessing those.
If your files containing tree blobs are migrated to S3 DA, you probably won’t notice this, as these files are also locally cached. However, if you loose your local cache or use a repo from many PCs, you need to access these.

A side remark: the prune from the latest beta builds should work in your setting, try --max-unused=unlimited or --repack-cacheable-only.

cdhowie · December 12, 2020, 7:15am

@alexweiss Note that he said:

So the repository would not be directly accessed to take backups, which mitigates all of your concerns except cost.

We can calculate which will be cheaper depending on how often you need to restore. Note that to keep this calculation simple, I am going to ignore the following:

Per-request upload and download fees.
Per-object GDA migration fees.
S3 housekeeping object fees.
S3 Standard storage fees both before transitioning to GDA and for restored objects held during retrieval.

B2 costs $0.005 per GB/month for storage and $0.01 per GB for egrees.

S3 GDA costs $0.00099 per GB/month for storage and $0.09 per GB for egress, plus retrieval fees depending on how quickly you need your data:

For access in 3-5 hours (standard), $0.02 per GB.
For access in 5-12 hours (bulk), $0.0025 per GB.

(If you can’t wait 3 hours for a restore then GDA is disqualified on the spot.)

The final variable we have is r which is the number of times you plan to restore per year.

For each provider the yearly cost is:

12*storagerate*gb + retrievalrate*r*gb

By setting two scenarios equal and solving for r we can find the point at which both scenarios cost the same in terms of restores per year. B2 vs S3 GDA with the standard retrieval tier is thus:

12*0.005*gb + 0.01*r*gb = 12*0.00099*gb + (0.02 + 0.09)*r*gb

r = 0.4812

0.4812 restores per year works out to just over one restore every two years. So, if you are going to use GDA, this would be my plan:

Know how much retrieval is going to cost (approx. $0.11 per GB in the repository).
Every month make sure that you have this amount in your savings to cover a retrieval. This will “cost” more than B2 in the short term as you have to reserve more funds than you would using B2 ($0.10 more per GB).

After 2 years without a retrieval, GDA starts to save you money.

For the sake of a complete answer, here is the calculation for the bulk retrieval tier:

12*0.005*gb + 0.01*r*gb = 12*0.00099*gb + (0.0025 + 0.09)*r*gb

r = 0.58

This is every 1.7 years instead of every 2. The standard AWS egress fees are way more of an issue than the GDA retrieval fees.

alexweiss · December 12, 2020, 7:23am

Thanks for this nice calculation!

This makes using GDA within AWS (e.g. restoring data to a EC2 instance) much more attractive (but this of course is the business model of AWS )

cdhowie · December 12, 2020, 7:32am

Indeed. If we remove the $0.09/GB egress fee and run the Standard retrieval scenario again, restoring 4.8 times per year would cause the cost to be equal with B2, so you’re saving money if you restore less often than once every 2.5 months… and if you are restoring every 2.5 months then there’s a pretty severe reliability issue elsewhere.

tomwaldnz · December 12, 2020, 8:09am

Thanks all, especially @cdhowie for the interesting calculations!

In my first scenario there’s a local / online copy of the repo, so restore from the cloud is for disasters only. Cloud is my tertiary backup, and I backup using two technologies, so cloud restore is really unlikely.

In my second scenario backing up directly, you end up paying upload costs then the cost to transition the data to a new storage class at $0.05 per 1000 objects. That doesn’t sound expensive but a small 400GB repo I have with 83,000 files would cost $4 to transition - not that much but given 400GB would only cost 40c per month to store that’s 10 months of storage just to do the transition. So direct upload probably loses much of the advantage of DA, and B2 would probably be a better option. Increasing the data file size would make cloud storage cheaper, given the per-file fees for uploads and transitions.

Couple of questions from what people have said:

Where is the local restic cache?
What is a “repack”? I assumed the data files were immutable / never changed - does “repack” imply data files pulled out of the packed files, files no longer needed are removed, and new data files created? Is that part of a prune? I haven’t read the documentation about Restic in a while, I read it when I started using Restic.

cdhowie · December 12, 2020, 8:20am

By default the cache is stored in $HOME/.cache/restic. This can be changed with the --cache-dir flag, or the cache can be totally disabled with --no-cache, which is probably preferable when the repository is on a locally-attached disk; in that scenario, the cache is just a waste of space (and expensive write IOPs).

Yes, with one correction – files are split into chunks and each chunk is stored as a separate “data blob” in the repository. If you replace “files” with “blobs” then you are spot on.

Yes. The basic prune operation is just mark-and-sweep garbage collection, but since packs are immutable, any packs containing garbage objects have to be rewritten as new packs without the garbage. Prune is basically:

Crawl all snapshots, marking used objects.
Create a set of objects to repack, which are all used objects that share a pack with an unused object.
Create new packs out of this set of objects and upload them to the repository.
Rebuild the repository index.
Delete the packs that were rewritten.

Note in particular that nothing is deleted until after everything else is done, to ensure that the repository is always in a consistent state if the operation is interrupted for whatever reason (SIGINT, restic crash, power cut, etc.). Restic allows duplicate objects; any duplicates left over from an interrupted prune will just be considered garbage by a future prune invocation and will be removed then. (Indeed, restic must allow duplicate objects since concurrent backups are permitted, and there is no coordination between backup processes. It’s possible and likely that running multiple backup operations in parallel will introduce duplicate objects.)

tomwaldnz · December 12, 2020, 8:45am

Great, thanks again @cdhowie. The cache sounds really helpful for my Linux servers as they store directly to S3, but locally I’ll disable the cache. I’ll report back anything interesting I find with using local storage pushed to AWS S3 Deep Glacier.

alexweiss · December 12, 2020, 9:35am

To add one point: There might be pack files which only contains blobs that are no longer needed. In that case, prune simply deletes these pack files.

And while prune up to 0.11.0 does repack all pack files that contain used and unused blobs, the latest beta allows to keep some (customizable by the user) of these used/unused pack files. This speeds up prune a lot in trade for lesser prune space reduction efficiency. If accessing your pack files is expensive or time-consuming, I recommend to use a high value of --max-unused.