Restic flow chart

Is there a flow chart of how restic backup works (i.e., how the repo is built, at what time data is encrypted, etc.)?

Thanks!

ChatGPT gave me this:

  1. File Selection:
  • The user specifies the files and directories they want to back up when invoking the “restic backup” command. Restic recursively scans the specified paths to identify the files and directories to include in the backup.
  1. Data Chunking:
  • Restic breaks the data into variable-sized chunks using a rolling hash function. This process identifies patterns in the data and creates variable-sized chunks, allowing Restic to efficiently handle changes within files.
  1. Content-Defined Chunking:
  • Restic uses a content-defined chunking algorithm, meaning that identical chunks of data will result in the same unique identifier (chunk ID). This is important for deduplication, as identical chunks are stored only once in the repository.
  1. Chunk Hashing:
  • Each data chunk is hashed using a strong cryptographic hash function, such as SHA-256. The resulting hash serves as the unique identifier for that specific chunk.
  1. Deduplication:
  • Restic checks the repository to see if each chunk already exists. If a chunk with the same hash is found in the repository, it is not re-uploaded. This deduplication mechanism helps save storage space by avoiding redundant storage of identical chunks.
  1. Compression:
  • Before storing chunks in the repository, Restic may apply optional compression to reduce the size of the data. This helps optimize storage space and can improve transfer speed, especially when using remote storage backends.
  1. Encryption:
  • Each chunk, along with its metadata, is encrypted using strong encryption algorithms (AES-256 in GCM mode for data encryption and Poly1305 for integrity checks). Restic uses a master key derived from a user-supplied password or key file to perform the encryption.
  1. Metadata Generation:
  • Restic generates metadata for each file and directory being backed up. This metadata includes information such as file permissions, ownership, modification times, and other relevant attributes.
  1. Repository Storage:
  • The encrypted and compressed data chunks, along with metadata, are stored in the repository. The repository is organized into a tree structure, with each node representing a snapshot and referencing the unique chunk IDs required for that snapshot.
  1. Snapshot Creation:
  • Once all the data is stored in the repository, Restic creates a snapshot. The snapshot includes metadata for the entire backup, providing a point-in-time view of the data. Snapshots facilitate easy and efficient restoration of data to a specific state.

Is this description accurate?

Check this out to verify those results. There are plenty more details on that site!

1 Like

Steps 2 and 3 are the same, Step 7 is wrong about the encryption details. Step 9: the tree structure stored in a repository (the repository itself only contains two levels of folders!) does not store snapshots.

The high level ideas roughly match restic’s behavior, but are slightly inaccurate. For example, the backup steps overlap in reality: the file selection happens during the backup, not before it, same as the metadata collection and data upload.

So a more accurate description would be that the backup recursively scans the specified path. Once encountering a new/change file, then the file is chunked and deduplicated. The new chunks are compressed and encrypted and afterwards combined into a pack file that is uploaded once it reaches a certain size. Once a folder has been fully compressed, then the metadata gets written to the repository (if it already exists then it also gets deduplicated). Once all folders have been fully processed, then the snapshot is created. The snapshot file only contains a pointer to the root folder and not the whole metadata of the backup.

All in all, the description is like 80% correct.

2 Likes

Does this mean that if I have a huge folder with lots of large files, and the backup keeps getting interrupted, restic will always start processing and uploading that folder from beginning, irrespective of how much has been uploaded previously in an interrupted session?

Restic will have to process those files again, but won’t reupload data chunks already written by a previous session (it may forget about a few minutes of progress).

1 Like