How are backups from different paths handled?

wojtek242 · October 15, 2022, 5:49pm

I have two questions regarding differing paths, but first let me explain the context.

I have a data directory, let’s call it /path/to/data. Within I have ZFS datasets as /path/to/data/dataset1, /path/to/data/dataset2, and so on. For those not familiar with ZFS, dataset basically means directory and snapshot means a filesystem snapshot.

I would like to backup their contents to an S3 bucket from their ZFS snapshots as the actual datasets are used by live services which I don’t want to stop them. The ZFS snapshots are, on the other hand, mounted read-only in some other directory. However, that path is different than the path of the actual dataset and because I include the timestamp in the snapshot name, the path will always be different for every single backup.

Question 1: If I backup different snapshots that originate from the same dataset, would that be as if I was backing the original directory multiple times and deduplication will handle it smoothly? That is, I won’t have a separate complete backup for each snapshot path? Are there any other problems that may arise due to using different paths for logically the same directory?

Question 2: What will happen if I try to backup snapshots from other datasets? Now, not only the path will differ, but the contents will also be very different. Will I be able to restore a backup to each dataset independently of each other? Or should I just use different buckets?

MichaelEischer · October 16, 2022, 10:28am

Restic will deduplicate the backed up data. However, the metadata will likely be deuplicated (see Restic 0.10.0 always reports all directories "changed", adds duplicate metadata, when run on ZFS snapshots · Issue #3041 · restic/restic · GitHub ). The forget command by default won’t be able to tell which snapshots should be grouped together, so you’ll probably have to tag the snapshots and use those to select which snapshots belong to each other. With different paths, the backup command won’t be able to use it’s parent snapshot mechanism, which means that restic has to fully read each file in the snapshot, even if it hasn’t changed. The data will still be deduplicated.

Snapshots are independent from each other. If the datasets contain largely different data, then it’s probably best to just use separate repositories, as you won’t benefit from deduplication anyways. Separate repositories have the benefit of containing a smaller index and thus reduce the memory requirements of restic. It also isolates the repositories from one another. If some repository were damaged, then it won’t affect the other repositories.