Backfilling snapshots from non-restic archives


#1

I have various backup routines in place, some with nightly tarballs, some with rsync+hardlink snapshots…

I’ve been using restic instead for about a month now and am seeing great results from the deduplication, and would like to merge all my older snapshots into my restic repository now

So in my restic repo right now I have:

$ restic snapshots
repository 4ae829a8 opened successfully, password is correct
ID        Date                 Host           Tags        Directory
----------------------------------------------------------------------
d7b4e7dc  2018-08-31 19:32:24  chris-desktop              /home/chris
196b9069  2018-11-05 21:21:16  chris-desktop              /home/chris
3503188d  2018-11-06 07:35:29  chris-desktop              /home/chris
b8a0ed0a  2018-11-07 07:35:18  chris-desktop              /home/chris
d36b7634  2018-11-08 07:35:05  chris-desktop              /home/chris
# ...
6f2f2d62  2018-12-05 07:35:09  chris-desktop              /home/chris
ab67016c  2018-12-06 07:35:06  chris-desktop              /home/chris
be494dfc  2018-12-07 07:35:11  chris-desktop              /home/chris
d9a75cf3  2018-12-07 18:47:15  chris-desktop              /home/chris
----------------------------------------------------------------------
35 snapshots

and:

$ ls -lh
total 2.3T
-rw-r--r--  1 root  root  2.1G Dec 30  2015 home.2015-12-30.tbz
-rw-r--r--  1 root  root  2.2G Jan  1  2016 home.2016-01-01.tbz
-rw-r--r--  1 root  root  3.0G Jan  8  2016 home.2016-01-08.tbz
-rw-r--r--  1 root  root  3.0G Jan 22  2016 home.2016-01-22.tbz
-rw-r--r--  1 root  root  3.8G Feb 27  2016 home.2016-02-27.tbz
-rw-r--r--  1 root  root  4.1G Mar  4  2016 home.2016-03-04.tbz
-rw-r--r--  1 root  root  6.2G Mar 11  2016 home.2016-03-11.tbz
-rw-r--r--  1 root  root  8.1G Mar 18  2016 home.2016-03-18.tbz
-rw-r--r--  1 root  root   27G Mar 25  2016 home.2016-03-25.tbz
# ...
-rw-r--r--  1 root  root   45G Jul 24 09:36 home.2018-07-24.tbz
-rw-r--r--  1 root  root   45G Jul 31 09:45 home.2018-07-31.tbz
-rw-r--r--  1 root  root   46G Aug  7 09:44 home.2018-08-07.tbz
-rw-r--r--  1 root  root   46G Aug 14 09:46 home.2018-08-14.tbz
-rw-r--r--  1 root  root   46G Aug 21 09:55 home.2018-08-21.tbz
-rw-r--r--  1 root  root   47G Aug 28 10:02 home.2018-08-28.tbz

It seems like I want to script extracting these .tbz files one at a time, snapshotting them, and then deleting them. I can override the date with --time but it doesn’t look like there’s any way to override the metadata directory which I’d like to say /home/chris but obviously I can’t extract there

i.e. I want to snapshot from /mnt/backup/home.2015-12-30/ but record /home/chris as the path in the snapshot metadata

Is it true that the directory not matching wouldn’t impact the efficiency of the deduplication and/or speed of snapshots?

I could deal with the path not matching in this case for my desktop backups, but keeping those correct will matter a lot more when I port the repo where I have snapshots of several paths across a dozen application servers.

Any tips or links to prior art on anything like this would be appreciated!


#2

It can impact the speed; if restic can locate a prior “parent snapshot” then it looks at file metadata (size, mtime) to decide if it needs to inspect the file contents for changes. If there is no parent snapshot, restic must chunk and hash each chunk to determine if the data changed. This is CPU-inefficient on the client, but does not change the amount of new data in the repository.

I have a similar problem: my backup scripts use LVM snapshots to create atomic backups, which is important for e.g. database servers, but I don’t want the backups to include the LVM snapshot mountpoint path. To get around this, I simply chroot into the snapshot path and run restic from inside the chroot.

You could do something similar, and create a minimal chroot (using e.g. debootstrap if on Debian) and then you can extract those backups to /path/to/chroot/home/chris and backup from inside the chroot.

See also: https://github.com/restic/restic/issues/2092


#3

OHh I didn’t think about making a chroot, that’s a good idea for a workaround. I might try that for my bigger migration.

I’m already about halfway through a compromise solution where I extract each tbz one at a time to /mnt/backup/home before snapshotting and manually pass in the parent snapshot (though I’m not sure that’s helping with the fresh extracts)

This gets me to at least all my old snapshots having the same path, though it doesn’t match my current snapshots or tell me what the actual backup up path from the host is.

I found this older issue tracking the same request:


#4

A parent snapshot is just a reference point used for optimization. In order for it to be useful, the paths of the files in the parent snapshot must exactly match the paths in the files being backed up – at least for the files that haven’t changed.

The only thing a parent snapshot does is make the backup process more efficient. The final result of the backup operation should be 100% identical regardless of which snapshot you use as the parent, or if you even use one at all. It’ll just vary the amount of time the backup takes.

Thanks for pointing that out. I guess my issue is a duplicate. :slight_smile:


#5

I’ve got my script using a chroot environment now (via habitat) to convert backups for my simple pile of .tbz archives on my home desktop, preserving the correct /home/chris path and force-feeding the correct parent to make sure this takes as little time as possible

Thanks for the chroot suggestion! It’ll get the job done for now


#6

You’re right: restic will detect that the files aren’t the same, and will read them anyway. So you can save yourself the trouble of passing in a parent snapshot. :slight_smile:


#7

That is a great workaround, simple and straightforward, I like it!


#8

It’s still kind of a mess though, I’ve had to muck with ownership and permissions a lot. This is going to be a pretty common use case for me I think (migrating snapshot setups to restic) and a CLI option to overwrite the prefix path would be immensely easier to script

The strip-components proposal in #555 seemed overly complicated to me. Whatever directory you target restic’s backup command at is the root tree of the snapshot, you only need to be able to manually override what gets recorded as the (canonical) path to that root. As a user I should be able to tell restic’s backup command “this tree is /home/chris” separately from what path on disk I’m feeding it to restic through.


#9

You’re right, that use case isn’t supported well. Having an option to write the correct path into the snapshot (that’s what snapshots prints) would make using the forget rules easier. In the long term, we need to rework this.


#10

If you have time to elaborate, I’m curious how deep the needed reworking would go. Would it not just be a UI-level enhancement?


#11

I would assume part of the complexity is that restic stores metadata for directories, too. If you are stripping off a bunch of directories and then adding a new prefix, how do we come up with the metadata for the “virtual” prefixed directories that don’t actually exist on disk?


#12

Shouldn’t it just record whatever metadata is in the tree it reads from disk? The directories do actually exist on disk, they just might be mounted at a path that does not reflect what they are a snapshot of.

While in my example case of extracting tar archives the directory metadata would be irrelevant (but may still be the best thing available), I could just as easily be mounting an actual disk image/snapshot where directory metadata would reflect the snapshotted filesystem at a point in time


#13

In the case of the linked feature request, it provides the ability to transform an on-disk path like /a/b/c to /z/c – so where does the metadata for / and /z come from?


#14

Well, I don’t agree with the --strip-components syntax, as the ability to strip more or less than the root of the backup creates some ambiguities and I don’t see the use case for it.

If you’re just replacing the prefix with another prefix, are directory entries still created?

e.g., if I have two files in my home directory one.txt and two.txt and run restic backup /home/chris, are you saying the snapshot looks like this:

date: 2018-12-01 12:00:00
host: chris-desktop
path: /home/chris

home
└── chris
    ├── one.txt
    └── two.txt

Rather than just like this:

date: 2018-12-01 12:00:00
host: chris-desktop
path: /home/chris

.
├── one.txt
└── two.txt

#15

Yes, it looks like the top structure.

Incidentally, I just checked and metadata for / is not stored.


#16

Interesting, during restore will restic overwrite any existing metadata for those prefix path components? i.e. if I’ve tweaked permissions on /home and the restore /home/chris, will the metadata on /home be modified or is the metadata only used when the path does not exist?


#17

I don’t have much time right now (the days before Christmas are very busy here).

The path recorded for a snapshot is independent of how the structure in a snapshot looks like. The path is stored for several reasons:

  1. Try to find a previous snapshot on subsequent backups (what you can manually set with --parent)
  2. Group snapshots and treat different paths as different groups for the forget operation
  3. Display the origin of the data in a snapshot to users (e.g. with snapshots)

What you’re trying to solve (as far as I understood it) is mostly 2.

The problem is, even if you pass the right parent snapshot in manually, restic will likely detect the files as new because they are new (you’ve recently extracted them from a previous backup), so you don’t gain so much from the “incremental” mode. You can check what restic prints with -v -v or by inspecting the number of “modified” vs “new” files restic prints at the end of a backup run. There’s a PR (#2047) which will add the --ignore-inode option to the backup command which may help in this situation. If it doubt, restic will always opt to re-reading data, because that’s the safe thing to do.

I think that in the long run, using the paths displayed to users for 1 and 2 is not sufficient, and we need to come up with something different.

Now you may ask yourself, “how can I influence the structure within the snapshot?” In earlier versions, restic would only use the last path component from the path passed to restic backup as the top-level path component. For example:

$ restic backup /home/user /mnt/srv/other/bar

would lead to /user and /bar being the top-level path element in the snapshat. That led to all sorts of nasty problems (see #549 for a collection).

We changed this behavior in restic 0.9.0, so that it behaves like users expect it to behave, modeled after what tar does:

  • For absolute paths, restic recreates the same structure within the snapshot: restic backup /home/user/foo will build the structure /home/user/foo in the snapshot
  • For relative paths, restic will use the paths as they are passed to it: cd /home/; restic backup user/foo will build the structure /user/foo in the snapshot

Note that in boch cases the path recorded in the snapshot will be /home/user/foo. When restic is run again with this (absulute) path, it doesn’t know the structure within the snapshot. If the previous run used the same (absolute) path it’ll work and restic will be able to skip unmodified files. If the previous run used the relative path (second case above), the structure in the snapshot is different and restic will re-read all files. That’s why I think that we need to come up with something else to detect unmodified files eventually.

In conclusion, you can influence the structure within a snapshot by changing the current directory and passing in relative paths to restic backup.


#18

Wow, thanks for all the thoughtful information! This helps my understanding a lot. I had no idea restic supported multiple paths in the same restic backup call.

That’s pretty great, so then separately from that would it be relatively simple to implement a command line option to override just the path recorded for the snapshot so I don’t need to use a chroot environment to trick it?

Then I could basically do:

cd /mnt/snapshots/chris--home/2018-01-01
restic backup . --path /home/chris

and get the same thing as if I had run on that day:

cd /home/chris
restic backup .

Regarding losing the benefit of the incremental mode: Currently I’m migrating the tarball snapshots I have of my home directory on my workstation as a trial run, and I don’t expect to be able to get the benefits of incremental scanning here. After this though I’m going to do the same for all my servers where I have a combination of whole-disk snapshots and hardlink+rsync incremental nightly snapshots. Both of those will have inodes consistent across snapshots


#19

Wow, I don’t know how I missed this possibility. If that’s the case, this can simplify a lot of my scripts by eliminating the need to chroot at all.


#20

Almost: restic does not support --path, so you’ll end up with snapshots saying you saved /mnt/snapshots/chris--home/2018-01-01, but the structure within the snapshot will be as if you’ve run restic within /home/chris.

If you then run restic forget, you’ll need to tell it to ignore the paths by setting --group-by host. If you have several different sets of backups, you need to work with tags and set e.g. --group-by host,tags and tag your migrated backups accordingly. Let me know if you need further help with that.

Except that there’s no --path option, so you still have the real, absolute directory recorded in the snapshot. But you can work with tags around that for forget.