Directly copying restic data between repositories

gurkan · July 1, 2020, 9:44am

Hi again,

I am currently using this PR to get copy feature. Which is working OK, but I wasn’t able to move ~238Gb of data within 6 hours locally, which is I assume caused by decrypt/re-encrypt mentioned in the initial PR message.

Since I need this kind of feature to be able to provide HA-backup endpoints (to be able to continue backup service while pruning), now I have another way, given the idea from @cdhowie 's this message.

Instead of creating 2 separate repositories, I can just create one repository and clone this empty repository (which only has config file and one key under keys folder) as backup repository.

Are there any risks if I just rsync all the files from one repository to the other after a while? I will have files under index, snapshot and data folders, which I assume won’t conflict (or conflict meaning same file anyway).

My quick test shows no issues decrypting the moved data, prune says there is a little bit duplicate data (which is expected) and repacked it. check command shows no issues.

But other than bloating the repository, any real risk visible (e.g. bricking something)?

Thanks!

cdhowie · July 1, 2020, 1:36pm

None that I can think of… just be careful not to use --delete.

alexweiss · July 1, 2020, 7:50pm

As @cdhowie already said: There is absolutely no problem to “merge” (multiple) repositories by just copying files from snapshot/, index/ and data/ together - as long as all those repositories are all sane and the same encryption key is used (i.e. as long as you use identical key files or key files that are all generated from within one repository).

If your keys do not fit, restic will not be able to decrypt part of the files and will abort.

It is a good advise to also use the identical config file, then deduplication between different repos-to-merge works much better as then identical big files will be chunked into identical chunks.

However note that you can have caching issues (i.e. files in cache getting loaded and deleted againg) if you access many of those repos from one computer. The cache is bound to the repository ID and this is also saved in the config file. If you just access your “merged” repository, files added will be added to the cache if necessary and that is it.

Again, as @cdhowie already mentioned, this deduplication only applies when pruning the “merged” repository, of course.

Just note that you should not do another backup into one of the original repositores and try to copy the newly generated files into a “merged” repository once you ran a prune on the “merged” repository. Your prune might have removed data that is still referred to in one of the original repositories and which might be used in the newly generated snapshot!

It is of course no problem to merge a pruned repository containing many snapshots with another repository that just contains 1 snapshot.

gurkan · July 2, 2020, 6:26am

Thank you very much for the answers, this method is going to be way more efficient.

Now my only sensitive case is: By design, second repository might be getting a new backup while I want to merge it to the main one, so I am planning to:

Check if there is any lock on 2nd repo (could only mean there is a backup going on)
If yes: copy all snapshot/index/data to the 1st repo
If no: move all snapshot/index/data to the 1st repo

Afaik* there is no re-packing/overwriting happens while getting a new backup.
So (theoretically) worst case: I’ll catch 2nd repo while a new backup is running and some “not yet used” data blobs is going to be copied to the 1st repo.

The whole loop is ran frequently (with 1-2 day intervals) so next run will iron things out anyway.

*(Please tell if I am wrong )

MichaelEischer · July 11, 2020, 11:10am

A new backup does not overwrite or delete any files, however the upload of new pack files during a backup run does not happen atomically. That is if you copy a pack file at the wrong point in time, then you end up with a half pack file in the 1st repo. That shouldn’t cause any problems as long a no snapshot depends on that file. A later copy / move should fix that problem however.

As I’m not completely sure whether the non-alphabetical order of snapshot/index/data is on purpose or just a coincidence: This is the order in which you must move/copy these directories, to ensure that all copied/moved snapshots are complete. edit: see cdhowie’s reply

There’s also a race condition here: There’s nothing stopping restic from starting a new backup run after you’ve checked for a lock. This could cause problems if the new backup run starts before moving the files has completed. As there’s currently no command to create fake lock files, you could try to deny the backup run access to the lock directory as a workaround.

cdhowie · July 12, 2020, 5:43am

tl;dr:

Copy data then index then snapshot.
DO NOT copy from a source that has ANY concurrent activity
- (unless you are copying from an LVM/btrfs/ZFS snapshot of that source)
DO NOT copy to a destination that has ANY EXCLUSIVE concurrent activity (non-exclusive is okay).

This actually is not the correct order. The order should be:

data
index
snapshot

If index is copied before data and the copy is interrupted, future backups will deduplicate against data that isn’t actually there, generating corrupt backups.

If snapshot is copied before data/index and the backup is interrupted, restores will fail as dependent blobs/indexes won’t be found.

This also ensures that any concurrent non-exclusive activity against the destination repository is safe. Concurrent processes might see (and ignore) a partial file but they shouldn’t fail.

Exclusive activity on the destination (such as check or prune) will almost certainly react poorly to the presence of incomplete files. Prune in particular could delete data that is in use if it saw the data pack containing blobs but not the snapshot referencing those blobs (because it was not yet copied when prune looked at snapshots).

This order becomes problematic when a concurrent process is running on the source. Consider a backup:

The backup might generate an index file containing a data file that the copy missed because it copied the directory containing the referenced pack before the pack was created, and now future backups will deduplicate data that isn’t there.

Likewise, if the backup finishes during the copy, the snapshot file will be copied over even though the referenced data may not be there for the same reason.

This would hint to copy files in the opposite order, but then the destination repository is effectively broken until the copy completes.

There is no bulletproof way to copy from a repository that is being written to, except to take an atomic snapshot of the whole thing (LVM, btrfs, or ZFS snapshots would work) and copy from the snapshot.

gurkan · July 12, 2020, 8:04am

I have an external lock which guarantees this won’t happen. Immediately after prune on 1st repo is done, there is no new backup routed into the 2nd repository. This copy/move actions come a bit after it.

Hmm my order was a bit untrained; data, snapshots and index. I thought indexes are a bit more important but you’re right.

Thank you both, now I understand a lot better.

Since even if I copy in the right order, unreferenced data files might be a problem, I’ll just skip the whole copy logic and only move files if I don’t find any lock on the source. This could be only problematic on very active repositories where it’s hard to find no locks, but I think it’ll be OK to skip once in a while, considering this whole lock/prune/unlock/move loop runs every 1-2 days.

As an extra question to @cdhowie ; wouldn’t even a block level/FS snapshot also risky if there is a backup going on into the source repository then? I mean theoretically we can also get a snapshot just while a data file is being written. Or is it handled on lower level so it’s somewhat guaranteed to be atomic? (But I’m not sure how this would work).

MichaelEischer · July 12, 2020, 3:27pm

Oh, you’re right, the order of operations I’ve suggest would cause the destination repository to be (temporarily) in an inconsistent state. I was just thinking about selecting a consistent source data set. but forgot to look at the effect on the destination repository.

By combining both orders it should actually be possible to copy from a source repository with concurrent backup runs while also ensuring that the target repository is never in an inconsistent state: List all files in the snapshots, index and data directories (in exactly that order) then copy these files in the inverse order (i.e. data, index and snapshots). That way the copy operation will include at least all data packs that existed when a certain index was written. The same also applies to the index and snapshots. If the source repository contains a file that is larger than it’s counterpart in the destination directory, then the file in the destination directory is incomplete and should be copied again from the source repo.

That order ensures that all blobs referenced by a snapshot already exist when a snapshot is copied. If the index files end up missing, then rebuild-index could regenerate those. So this won’t break the repository but could end up being a nuisance.

The current implementation is not (yet) optimized for performance. A single core of a modern CPU should be able to handle the hash/crypto operations fast enough to copy at least 100MB/s between repositories. I think the main bottleneck currently is loading the source blobs one after the other.

cdhowie · July 12, 2020, 5:59pm

No, because the backup operation only writes an index file after it’s written the packs referenced by the index, and the snapshot is written at the very end. A point-in-time snapshot of the source repository could, at worst, contain a half-written file. This is automatically fixed by prune and most commands (except maybe check) should ignore any incomplete files.

LVM, btrfs, and ZFS snapshots are all atomic. (If they weren’t, there wouldn’t be a point to them.)

In other words, a filesystem-level snapshot taken during a backup looks exactly like the restic backup process was killed before the backup completed, at the exact moment the snapshot was created.

gurkan · July 12, 2020, 7:17pm

Ah cool, then I’ll keep watching that feature. There might be another magical patch landing

gurkan · July 12, 2020, 7:19pm

That was very clear, thanks. Then I can go either snapshot route or check out hardlinks if copy feature lags longer