Pre-usage questions

I’m in search of the elusive perfect backup solution. Did I find it? Can you all tell me if this is possible?

  1. Can restic do file level dedup or is it block level only?
  2. Does restic store the file/block hash locally?
  3. If the hash is local, how do backups from other machines benefit by deduplication? Does it have to read every file/block in the backup to determine hashes and then dedup?
  4. If the dedup is file level, can I go to the backup destination and see the actual files and restore a file without using the restic client?

For me a “perfect backup solution” would be cross platform (Win/Linux) where the deduplication is file level with the hash stored on the destination (or maybe a google/amazon database store of some sort). The client would support storing backups in different sub-folders and support date/time base differentials (such that yesterday’s backup would be a sub-folder with the datetime and today’s would have today’s datetime). The destination could be Google Storage Spaces or Amazon S3 or Glacier. And you could do a restore by going to the backup and grabbing the files you need. What this also infers is that there are pointers for deduplicated files so that today’s backup folder would appear to have the full backup but from a storage perspective only be taking up space for new non-duplicated files and have pointers for the rest. And to make it even MORE challenging, I’d like a process that could run every month or so that would first prune old daily backups leaving monthly backups in tact. And I’d like a way to move old monthly backups (> 1 year) to a cheaper storage solution (like Google Coldline). The latter would actually be possible via script so long as all the other requirements are met. Pruning would need to be in the app.

What scenario would drive me to want these features? I host websites for my clients that I do dev work for. I have multiple web and database servers and I know there is a ton of duplication (especially with all the wordpress sites). I want to be able to maintain point-in-time backups using the least amount of space (without a huge overhead) and be able to easily go restore or copy a file or directory without having to go through a client. And due to cost and reliability I want to use Google/AWS file services. I know, this is probably a pipe dream.

So if restic can’t do this, does anyone have any solutions that may? Or is there a way to use something else with restic to do what I’m looking for?

THANKS!!!b

Restic does block-level deduplication, which is generally superior to file-level deduplication.

Hashes are stored in the data pack files as well as indexes within the repository. The indexes can be cached locally for faster lookup, but a cache is not required; deduplication will still happen without it.

You couldn’t do this even with file-level dedupe because the repository is encrypted and content-addressable; the name of each object is the hash of its contents.

Restic is.

This doesn’t make sense; you really want block-level dedupe. It is substantially more efficient, and deduplication is even possible intra-file, not just inter-file.

All relevant data, including hashes, are stored in the repository.

Restic supports free-form snapshot tags, which might work for you.

I’m not entirely sure what you mean here. Restic doesn’t do incremental or differential backup, because deduplication is repository-wide. Deduplication can share chunks with files from a totally different machine in snapshots taken years apart.

Object storage systems are well supported. A few are built-in, others can be used via the rclone layer.

Glacier and S3 GDA tiers are not terribly practical for use with restic. There are a ton of caveats, which I tried to document in a forum post.

Restic’s repository format is not designed to be directly used without the aid of some tool that understands the format. At a minimum, you would need to be able to decrypt the data. Then you have to unravel the tree structure to figure out what blobs can be concatenated to rebuild the file, and then you have to go find those blobs.

However, note that you can use the restic mount command to mount a locally-browsable filesystem representation of the repository. This can be used to selectively cp files or whole directories out of snapshots.

We use restic in exactly this scenario. :slight_smile: It works quite well.

Restic has you covered here.

This is not feasible, but restic mount makes it simpler.

Restic can do this.

In summary, the only things you’ve presented that restic doesn’t do are:

  • File-level deduplication, which is generally inferior to block-level deduplication, so this is a bit of a weird requirement.
  • Allow accessing backed-up files directly in the repository. Restic works a lot like Git in that content is stored in a content-addressable structure, so the hash of each object is its name. restic mount is a workaround that may work in your case.

Note that if you want to use object storage like S3, you cannot have both deduplication and direct access to backed-up files. S3/B2/GCS do not support symlinks or hard links; you would necessarily have to duplicate a file to make it available under multiple keys. Your list of requirements is therefore self-contradictory and can’t all be met. There is no tool that could implement both deduplication and direct access on object storage services. This is a limitation of object storage, not the tool.

The closest thing you could probably get to the kind of deduplication you are looking for with direct access to files would be to run everything with btrfs and use send/receive to send snapshots to a central server. You could periodically offline-deduplicate the backup server. Then you’d have deduplication under the hood, but direct access to files. Deleting old snapshots would be as simple as btrfs subvolume delete on them. However, this can’t run on object storage services.

I would posit that restic checks enough boxes that you should at least consider it. It at least checks all the boxes that don’t contradict each other.

1 Like