Restic copy and deduplication

I’m trying to figure out the perf implications of this statement in the “restic copy” help:

Note that this will have to read (download) and write (upload) the
entire snapshot(s) due to the different encryption keys on the source and
destination, and that transferred files are not re-chunked, which may break
their deduplication.

So if my source snapshot is 1TB worth of the same 1KB file repeated over and over, a copy cmd would:

  1. expand/decrypt/download entire 1TB source snapshot (hopefully incrementally) to localhost
  2. reapply dedup algorithm to src snapshot
  3. upload deduplicated data to dst repo

Is this about right?

So the main inefficiency is that, even though the entire src snapshot is deduplidated to <<1TB, the operation will have to download the expanded version and re-dedup it. The bandwidth usage between src and localhost is 1TB, and between localhost and dst is only the dedup’ed size.

And because the two repos don’t share any dedup information, this entire process will happen each time a copy like this takes place, right?

I’m considering using copy to replicate a local repo to the cloud, but because of these (possible) issues, it doesn’t seem like a good fit.

What if both repos share the same encryption key? Does something a bit more sane happen ? If the keys are the same, is it possible to avoid the expand/dedup?

What’s been proposed before is to rclone/rsync the local repository to your cloud provider. That’ll be much more efficient if you’re OK with having two identical copies of your repository. Bear in mind that also has drawbacks, like syncing/replicating corrupted packfiles, broken source repos, etc, if something bad happens to your local repo eventually.

and that transferred files are not re-chunked, which may break
their deduplication.

That means if you have big-video.mpeg that you backed up to your local repo and you back it up to your cloud repository after the restic copy, it won’t benefit from restic’s deduplication and restic will most likely have to upload all the chunks again duplicating the original big-video.mpeg that was added to your clour repo via restic copy. So big-video.mpeg will be essentially duplicated.

There are a few more details in https://github.com/restic/restic/issues/323

We have this setup and it’s possible to copy files between repos. You can check my question and conclusions on this thread. But in that case (since the chunk sizes will be slightly different) big files (+512kb) will only dedup on the creation repo*. My main question was about consistency of the data.

* This is my assumption, considering this info. I’d be glad if someone corrects it.

Will this hold true if --copy-chunker-params is used to init the second repo? If I understand it correctly, that would mean that the same data gets chunked the same way, so there’s at least a possibility that copy would only upload the deduplicated data.

You mean filesystem copy operations with cp/rsync/rclone/whatever, right? I would be hesitant to do that, as it seems fragile compared to first-class restic commands. Is there buy-in from the developers that these types of operations will work in the future?

Replying to myself, this seems to be correct. From docs:

With the same parameters restic will for both repositories split identical files into identical chunks and therefore deduplication also works for snapshots copied between these repositories.

And then it explains how to use --copy-chunker-params to accomplish this on init.

So I’m going to re-init the local repo with the remote params, because re-uploading over USB is going to be much faster than using my internet connection.

Copying or syncing a repository, which is made up of files in a filesystem or objects in an object storage, should be perfectly fine, as long as it’s done in a complete way and without something else interfering. E.g. if noone is using is and you sync it so you have an identical copy at the destination, you should be able to use that copy just like the first one. From restic’s perspective, it’s just the URL to the repository that changed, since everything else in it is the same.

Not sure what buy-in you’re looking for - as long as the repository is just files or similar, the above should be fine. There’s nothing for the developers to make promises about :slight_smile:

1 Like

I think I conflated two things I’ve been reading about recently:

  1. rsync/rclone/cp duplication of every bit in a repo from one place to another. Only worry here is preventing modification of the repo while the copy is in progress, and that’s not really restic’s job.

  2. poking into the internal repo structure, and copying a subset to accomplish something (repo merge, snapshot copy, etc)

I agree with you about #1, it seems pretty low risk. #2 is what scares me.

That’s correct, AFAIK. If you use the same chunker_polynomial via --copy-chunker-params (restic cat config), deduplication should work equally the same in both repos. Though I haven’t used that feature myself.

wrt rsync/rclone/cp, there’s nothing to be afraid of IMO, as @rawtaz said. I don’t have a lot of experience with restic but what I’ve observed and learned so far is that restic has a very simple repository layout and most of the data is either immutable (packfiles, snapshot files, etc) or can be rebuilt (indices), so chances of breaking it syncing data from one place to another are very small, and it can “easily” be repaired if you have packfiles in multiple places or the source files still around. It’s far different from, let’s say, an active database with an efficient but opaque data structure where records are being added/removed/updated frequently.

That said, trusting restic to do restic things ain’t a bad strategy either, albeit somewhat inefficient for replicating an entire repository to a different place.

Just to clarify, as it seems to me, you understood something wrong here:

The copy command does not re-chunk the data! In your example it would only transfer the 1KB once + the metadata.

What is meant by “breaking” deduplication is if you add this 1TB again directly to your destination repo (or if it has been already added), it will be chunked by the chunking parameters of your destination repo (maybe again worth a couple of KB) and those chunk(s) will be saved additionally to the chunks copied from the source repo. So if the chunking parameters do not match and you have all data copied from a source repository and added directly, you may end up with a repository that occupies about double the size.

So, using identical chunking parameters is always a good idea, but if you don’t, you won’t get extremely bad results…

Just to clarify, as it seems to me, you understood something wrong here:

The copy command does not re-chunk the data! In your example it would only transfer the 1KB once + the metadata.

Oh, this is even worse than I thought! So the params just determine the how a file is split into chunks, but are not required to reassemble? That way the destination can expand the foreign chunked data, but can’t store it with existing chunks.

This side effect doesn’t seem very obvious from the copy help message:

Note […] that transferred files are not re-chunked, which may break
their deduplication.

I guess reading it again, the storage penalty could be implied from this, but it seems significant enough that it should be explicitly stated. Copying the same dataset from multiple repos (with different params) could chew up master repo storage quickly.

Would it make sense for copy to throw an error by default if the chunker params don’t match between repos? And then permit it with an --i-know-what-im-doing flag.

We intended to merge some repositories with copy but it was really slow (10-30min per snapshot for ~2000 snapshots). Therefore I suggest you try copy on 2 local repositories before you make up a backup plan that involves using copy on a regular basis.