Pre-usage questions

cdhowie · October 19, 2020, 9:37pm

Restic does block-level deduplication, which is generally superior to file-level deduplication.

Hashes are stored in the data pack files as well as indexes within the repository. The indexes can be cached locally for faster lookup, but a cache is not required; deduplication will still happen without it.

You couldn’t do this even with file-level dedupe because the repository is encrypted and content-addressable; the name of each object is the hash of its contents.

Restic is.

This doesn’t make sense; you really want block-level dedupe. It is substantially more efficient, and deduplication is even possible intra-file, not just inter-file.

All relevant data, including hashes, are stored in the repository.

Restic supports free-form snapshot tags, which might work for you.

I’m not entirely sure what you mean here. Restic doesn’t do incremental or differential backup, because deduplication is repository-wide. Deduplication can share chunks with files from a totally different machine in snapshots taken years apart.

Object storage systems are well supported. A few are built-in, others can be used via the rclone layer.

Glacier and S3 GDA tiers are not terribly practical for use with restic. There are a ton of caveats, which I tried to document in a forum post.

Restic’s repository format is not designed to be directly used without the aid of some tool that understands the format. At a minimum, you would need to be able to decrypt the data. Then you have to unravel the tree structure to figure out what blobs can be concatenated to rebuild the file, and then you have to go find those blobs.

However, note that you can use the restic mount command to mount a locally-browsable filesystem representation of the repository. This can be used to selectively cp files or whole directories out of snapshots.

We use restic in exactly this scenario. It works quite well.

Restic has you covered here.

This is not feasible, but restic mount makes it simpler.

Restic can do this.

In summary, the only things you’ve presented that restic doesn’t do are:

File-level deduplication, which is generally inferior to block-level deduplication, so this is a bit of a weird requirement.
Allow accessing backed-up files directly in the repository. Restic works a lot like Git in that content is stored in a content-addressable structure, so the hash of each object is its name. restic mount is a workaround that may work in your case.

Note that if you want to use object storage like S3, you cannot have both deduplication and direct access to backed-up files. S3/B2/GCS do not support symlinks or hard links; you would necessarily have to duplicate a file to make it available under multiple keys. Your list of requirements is therefore self-contradictory and can’t all be met. There is no tool that could implement both deduplication and direct access on object storage services. This is a limitation of object storage, not the tool.

The closest thing you could probably get to the kind of deduplication you are looking for with direct access to files would be to run everything with btrfs and use send/receive to send snapshots to a central server. You could periodically offline-deduplicate the backup server. Then you’d have deduplication under the hood, but direct access to files. Deleting old snapshots would be as simple as btrfs subvolume delete on them. However, this can’t run on object storage services.

I would posit that restic checks enough boxes that you should at least consider it. It at least checks all the boxes that don’t contradict each other.