Do incremental off-site backups require to download previous backup(s)?

Hello,

I have been looking to use Restic for off-site backups and I was wondering about its behavior with incremental backups.

For example, Duplicity has to download every previous backups in order to calculate the delta before creating a new backup.

Does restic have a similar behavior or does it rely on a local cache as it is implied in this answer?

Thank you.

No, and restic doesn’t do incremental backups in the sense that that term historically applies (master backup plus deltas based on either the master or the last delta).

To simplify, when you perform a backup you create a file list (snapshot) of all files that exist at the time of the backup. Each file in that list is checked to see if it has already been backed up before; if not it is uploaded. When you do a forget you are getting rid of a list or lists (snapshots). When you do a purge you are checking for which files are no longer referenced in any remaining lists (snapshots) and they are deleted. As I said, this is a simplified description of what happens.

The operation that can impact ingress/egress from the cloud the most (after the first backup), is the purge, as files are not stored individually, especially if a number of files have been deleted.

If you go through the last month or two of posts you will come across a number of good discussions that explain things in more detail, and you will see that it is easy do get a listing the “incremental” change between two snapshots (I do this on every backup, more to see what has disappeared than has been added).

Re the cache, that speeds up determining what files already exist in the repo, and can be bypassed with a force option on a backup. It slows things down, but the data pulled from the repo is not the backed up files, so data usage is minimal.

Hope this helps.

1 Like

More be more precicely, restic saves a repository in four kinds of files:

  • one snapshot file for each snapshot (one is created with each backup run)
  • a bundle of index files
  • data files containing metadata (the so-called “tree blobs”). This are filenames, file times, etc.
  • data files containing files contents (the so-called “data blobs”)

When doing a backup, restic always needs to read the index files. By this information it knows which blobs are already present and if they are it doesn’t save them again (this is the so-called “deduplication”).

If you do a follow-up-backup restic tries to automatically make a snapshot based on a previous (so-called “parent”) snapshot. To do so, it reads all snapshot files to find that parent snapshot. Then it reads the metadata of all the directories/files within this snapshot. By this information it can identify which directories/files have really been modified and only needs to process these (note that the generated snapshot still yields a full backup!). This allows backup speeds like doing only a incremental backup.

Summarizing, for a follow-up backup to a off-site repository restics needs to access all index files, all snapshot files and some data files.

These kind of files are, however, cached by default. So if you run the follow-up backup from the same machine and did not clean up any cache, they are read from the cache and not downloaded from the repository.

Note that during backup restic never needs to read the data files containing the actual files contents which are usually more than 99% of the repository size.

A few side remarks:

  • restic does need the information from the repository which files are present within the repository. So it basically does some a list or ls in the repository data structure (depending on the storage type you are using)
  • I omitted some kind of files like config file, key files and lock files. Those are read from repository as they are actually not cached. This is also the reason why restic does not work well with “write-only” storage or “cold storage” (with large access times). I made some PR which add options such this can be fixed. With those options, restic backup really only needs listings of the repository and does read all data from the cache (if present there).

Thanks @alexweiss for your answer, it’s much clearer now.

restic does need the information from the repository which files are present within the repository. So it basically does some a list or ls in the repository data structure (depending on the storage type you are using)

So I guess that if I use S3 as the repository with a policy that automatically moves files to Glacier after a few days, restic won’t be able to make follow-up backups?

I did not use S3 Glacier so far, but I would guess that an aws s3 ls would also work within S3 Glacier. So applying this policy to only files under /data/ should perfectly work (if you just make backups with a “parent” snapshot where the metadata is in your local cache). If you intend to make a policy for the whole repository you’ll get trouble with key, config and lock files. Then you can think about patching your restic with some of the PRs I proposed, see https://github.com/restic/restic/issues/2504

Using the Glacier or GDA tiers with restic is fraught with caveats and expense. I would strongly suggest not using them.

If S3 storage is too expensive, consider using Backblaze B2.

Well, I was considering using Scaleway’s Object Storage and C14 Cold Storage offerings in order to get even cheaper storage than B2, but you definitely convinced me to stick with B2 (which is cheaper than Scaleway’s Object Storage alone).

So I guess that if I use S3 as the repository with a policy that automatically moves files to Glacier after a few days, restic won’t be able to make follow-up backups?

This is pretty similar to my approach. I do the following:

  1. I run restic backup against a local repository (on another RAID volume)
  2. After the backup is complete, I perform aws s3 sync [repo path] [s3 bucket] --delete to synchronize any changes to S3
  3. I have S3 lifecycle rules that moves data/ files in my repo to Glacier Deep Archive after 7 days. Only the data directory (which is like 98% of the total repo).

This ticks off so many of my goals:

  • Implements US-CERT’s 3-2-1 backup recommendation (2 local copies and 1 remote copy of data)
  • Very fast performance for use cases that use the local copy
  • I can directly access the cloud copy using the restic client for many tasks (lists of snapshots, diff snapshots, etc)
  • Pruning is easy. I perform prunes against my local copy weekly.
  • Pruning is low-cost (for my data dynamics). My GDA early deletion fees are negligible (< $0.02 per month). For people with more dynamic data, this could he higher, but could perhaps be mitigated with a longer delay before moving to GDA.
  • Very low cloud-storage costs. (My Amazon costs are about 30% of what I used to pay at Wasabi.com)

Just a side remark about the pruning issue:

Once that PR #2718 is merged, the picture IMO changes completely.

With this PR you can enforce that the data files containing only file contents (“data blobs”) are never read during prune - but only deleted if no longer needed. (note that “tree blobs” will be still needed, so they are downloaded if they are needed and not present in the local cache, e.g. if you backup from many machines into one repository).

IMO this cures the main issues @cdhowie mentioned when using S3 Glacier. Then there will be no longer the technical need to prune locally and sync to Glacier.