Naming snapshots and the benefts thereof

damoclark · August 30, 2024, 8:45am

Michael:

I’m currently rather hesitant to add anything further to the roadmap for now. There’s already enough queued for 0.18 to probably require two releases and 0.19 also has the chance to spiral out of control.

Understandably, there is jostling to have one’s own feature requests added to the front of the queue of work being proposed. An unenviable position, having to make prioritisation decisions.

What follows is an argument to prioritise a particular issue that I have been advocating to be addressed for some time. I propose, that if added to the 0.18 roadmap, not only would it help solve 3 existing items in the 0.18 roadmap (as given below), but also lay the groundwork for resolving a long-standing underlying issue that makes restic more difficult to use than it need be - its tight coupling with absolute paths.

So what’s the Issue?

Back in 2018, fd0 asked an important question:

fd0:

… it surprised me that backing up the same data, to the same destination, but from a different path, didn’t use the a parent snapshot (and no using parent snapshot abc… was printed. Actually, when the backup finished, it printed two separate lists of snapshots…

I can understand why it’s unexpected for you as a user. For me as a developer, I need to ask (honest question): How would restic detect that the files which were previously located in /original/path/mydata are the same that are now mounted at /usbdisk/mydata? Sure, we could try matching the file names and sizes and inodes and device IDs, but all that is only approximate and not exact. Therefore, at the moment, we err on the side of caution and always read data that we’re not sure where it comes from. That’s what users can (and should) expect from a backup program: Saving the correct data

We’ve had several reports now with users reporting their surprise and non-understanding of why restic re-reads the files. Do you have any idea on how to improve this situation from a user’s point of view?

I do. Specifically, my idea aims to:

Reduce confusion for users by changing the current, and nebulous conceptual model of Restic backup snapshots;
Make Restic more reliably match --parent snapshots;
Make all backups relative to one or more defined paths (or ‘/’ by default as currently hard-coded) on command line like other popular backup tools (reducing confusion between absolute & relative backups); and,
Make it easier to support backups of removable media, and mounted file system snapshots (LVM, ZFS, etc) without fudging paths and/or complicated chroot set ups

What’s the new conceptual model?

There are three dimensions to every backup snapshot:

Where it was taken (host);
When it was taken (time); and,
What was taken (path/s).

Except, the path or paths can change between snapshots, and unlike the host, and time, cannot be overridden at backup time. The new model would represent the three dimensions of every backup snapshot as:

Where it was taken (host);
When it was taken (time); and,
What was taken (Backup Set Name).

To achieve this, my idea proposes two changes:

Explicitly name each backup snapshot that is taken, with all snapshots sharing the same name on the same host being defined as a “Backup Set”, rather than implicitly through the absolute path/s on the host computer. This user-chosen Backup Set Name defines what is backed up, rather than the absolute path/s.

Explicitly defining the ‘what’ with a user chosen name, means path/s can change, and restic will still know which parent to use.

And because the absolute paths would no longer be required to identify the parent, the second change comes into play:

All backup snapshots are relative. They can remain relative to ‘/’ as they are now, but can also be relative to the source paths that are given, much like is typical of existing backup utilities such as tar, using its -C semantics. Thus, you construct your own arbitrary filesystem structure in your backup snapshot, rather than it being tethered to /, including Windows (i.e. /c:/path)

What about tags?

While restic supports tags, I do not consider them in the three-dimension model. This is because every snapshot has a time, a host, and path. However, a snapshot can have many tags, one tag, or no tags associated with them. Thus, tags are not “exact” enough to assist with identifying parent snapshots. In the above model, consider tags as ‘pointers’ to one or more snapshots - a means of grouping snapshots together, arbitrarily or otherwise, and only optionally.

The proposed new conceptual model

The following diagram of a restic repository attempts to illustrate these three dimensions in the new model being proposed. It is thought that a similar diagram in the user documentation would help users understand how their snapshots are logically arranged within a repository, using the three dimension model that we intuitively rely upon.

Questions for readers:

What might your existing restic repository look like using this new model?
Can you identify all your existing backup-sets, as defined by their one or more paths?
Can you represent them in a similar diagram, substituting your own where, when & what?
Does it make the contents of your repository clearer (or not)?

These are sincere questions. Please share the good and the bad - what works and doesn’t.

In essence, a ‘backup-set’ prefigures ‘what’ is being backed up. By defining this backup-set with a name, we can logically and sensibly group together ‘like’ snapshots, even when the path to the files change, and whether tags are used or not.

And we can construct our own internal filesystem structure if and when it makes sense to. Consider the backup-set ‘photos’ on host ‘desktop’ in the example above. Pay attention to the proposed restic backup command, using the -C syntax of tar.

-C c:\users fred\photos mildred\photos

Says make the backup relative to path c:\users and only back up fred\photos and mildred\photos from that relative path, resulting in the following internal snapshot directory structure:

/
├── fred
│   └── photos
│       ├── photo1.jpg
│       ├── photo2.jpg
│       ├── photo3.jpg
│       ├── photo4.jpg
│       └── photo5.jpg
└── mildred
    └── photos
        ├── photo1.jpg
        ├── photo2.jpg
        ├── photo3.jpg
        ├── photo4.jpg
        └── photo5.jpg

The path c:\users is not present.

Here are some further example restic commands and proposed syntax that use this new model.

For removable media:


# Unix-like

restic backup -r repo --name 'my external hdd' -C /var/media/<username>/myfiles

# Windows

restic backup -r repo --name 'my external hdd' -C X:\

# MacOS

restic backup -r repo --name 'my external hdd' -C /Volumes/myfiles

For filesystem snapshots:


# Mount all snapshots relate to /mnt/backup/root

mount /dev/vg00/root_snap /mnt/backup/root

mount /dev/vg00/home_snap /mnt/backup/root/home

mount /dev/vg00/var_snap /mnt/backup/root/var

restic backup -r repo --name fullbackup -C /mnt/backup/root

umount -R /mnt/backup/root

# Or mounted separately, and reconstructed using `-C` syntax

mount /dev/vg00/root_snap /mnt/backup/root

mount /dev/vg00/home_snap /mnt/backup/home

mount /dev/vg00/var_snap /mnt/backup/var

# -C /mnt/backup/root will backup everything relative to that path

# -C /mnt/backup home var will backup only directories home and var, but relative to /mnt/backup

# thus, reconstructing the original filesystem mount layout

restic backup -r repo --name rootfs -C /mnt/backup/root -C /mnt/backup home var

The disquisition

For the curious, details can be explored through the disquisition in the following github issue:

Let’s name our snapshots and define them collectively as a ‘backup-set’ - a means of clearly identifying the ‘what’, with the ‘where’ and ‘when’ in our restic repositories

Note that ideas evolve throughout the discussion.

Transition

How can such a substantial conceptual change be made smoothly, without requiring users to make immediate adjustments to their scripts and automations?

Presently, parent snapshots are identified by triangulating the the three dimensions of host, time and path/s. For existing snapshots in a repository (and existing working environments under the current paradigm), a normalised representation their path/s (remembering that there can be more than one path for a snapshot), could be declared as the default backup-set name. This means that --name can initially be defined as an optional parameter. If no --name argument is provided (in other words, restic is executed without using the above new syntax), then matching of parents would continue as-is, and thus, will not break any existing automation.

However, if you change paths, you will in-effect, implicitly create a brand new backup-set.

To use the -C function, you would need to specify --name.

This addresses the input interface compatibility layer. What about metadata outputs from restic?

A new output representation (e.g. snapshots command) from restic would thereon refer to backup-sets, rather than paths, where existing snapshots would use the normalised form of the paths as the backup-set name. So the new model would emerge as intended, and over time, users could ‘rename’ their backup sets (a new command) from the existing normalised path/s to a new given name. Then invoke backup snapshots using the new explicit name with the --name argument. Or quote the normalised path as the name, if the path is a good name anyway.

A normalised path might be represented as the sorted concatenation of one or more paths provided to the restic backup command, separated by a comma. An escape character might be needed to represent a literal comma and the literal escape character.

Then a future release could mark the --name argument as mandatory, and simplify the entire restic interface in service of the new model. Every snapshot is named, and matched against existing parents, or if a new name host combination, then it will be flagged as a new backup-set and will perform a full scan without a parent.

Why is this so important?

The user forums are awash with sensible use-cases that require overriding the defaults for each of these three dimensions. Yet, only two are supported: --time, --host, but not the path.

In terms of usability, and in consideration of the broad range of user skills, from novices, to System Administrators, challenges and/or confusion appear again and again in the user forums evidenced through a simple search: Search results for ‘path parent’ - restic forum

This is a sampling, based on roughly half the results returned from the search - there were more. The earliest sample dates back to the beginning of 2018 circa fd0’s question. The most recent example of confusion was posted only last month:

Parent content identification bug

Clearly, a great deal of time has been spent by community members and developers responding to the inability of restic to cope with data appearing at different paths between backup snapshots, and the complexities of aligning parent snapshots. The evidence to me suggests, this is a high priority issue, and worthy of the 0.18 roadmap.

Next Steps

There is still more detail to flesh out with this approach, and design decisions to make.

A summary of my argument is:

this work, while substantial, may reduce the number of forum posts from users confused by Restic’s snapshot model, and thus, give valuable time back to the broader community; and,
it will also solve (or substantially contribute to) three items already listed for the 0.18 roadmap

So, the questions become: is there broad support for exploring this model change further? And if so, as a priority for 0.18?

fd0 · August 30, 2024, 9:04am

I’ve moved this post to a new thread, it’s too much for the roadmap thread

MichaelEischer · August 30, 2024, 10:21am

I definitely want to look into this soonish, although I first have to finish the fs code refactoring which will still take a few weeks. I still see the --name and -C options as two somewhat separate problems, although it’s probably better to discuss both and their interaction before deciding on what to implement in 0.18.

I’ve updated the placeholder in the roadmap Restic 0.18 roadmap · GitHub to also include your issue.

Would the snapshot then still include paths and if yes, which ones? Currently at least when using absolute path (but not for relative paths) the paths of a snapshot give a rough indication at which paths a snapshot contains its files.

We definitely need a way to include files from multiple volumes in a single snapshot. And I’d like to avoid something horrible like restic backup -C \\?\ C: D: . (The colons would also cause problems during restore).

I usually just create a full system backup with a few excludes, so it won’t make much of a difference for me.

alexweiss · August 30, 2024, 12:44pm

As some input / possibility to test the concept:

rustic features two extension:

A label which can be set for a snapshot using e.g. rustic backup --label <LABEL>. This is basically what @damoclark named “Backup Set Name”, --name. If you group by host,label, you already get the right parent for your backup runs. Note that the label is also included in the filter-rules, here the option --filter-label.
The possibility to backup relative paths: rustic backup relpath/. This is almost as the proposed -C option, but you first have to cd into the right parent path before running rustic.
There is actually also the option --as-path which is able to rewrite paths, so this could be also used to emulate the proposed -C.

So, feel free to test everything out using rustic and also feel free to take a look at the implementation details (but those are actually no rocket-science)!
And, @MichaelEischer if you decide to add such functionality to restic, please use the name label in the snapshot file, so restic and rustic stay compatible. Thanks!

MichaelEischer · August 30, 2024, 1:25pm

IMO tags and labels feel too similar. Why is there only a single label, but multiple tags? What’s even the difference between tags and labels? And why not use something like backup_name or so that is more descriptive?

That’s also the status quo with restic.

What does rewrite mean? Does it only change the paths in a snapshot? Or also the structure within the snapshot?

alexweiss · August 30, 2024, 1:56pm

alex-dev@latitude:~/rust/rustic$ ~/restic -r /tmp/repo backup src/ --quiet
enter password for repository:
alex-dev@latitude:~/rust/rustic$ ~/restic -r /tmp/repo snapshots
enter password for repository:
repository 7272fa99 opened (version 2, compression level auto)
ID        Time                 Host        Tags        Paths                           Size
--------------------------------------------------------------------------------------------------
38550d9d  2024-08-30 15:53:10  latitude                /home/alex-dev/rust/rustic/src  207.099 KiB
--------------------------------------------------------------------------------------------------
1 snapshots

But I just checked: The tree of that snapshots starts with src, which I wasn’t aware of. I thought the tree would match the snapshot path…

Both. rustic creates snapshots where the snapshot path matches the tree structure.

MichaelEischer · August 30, 2024, 7:49pm

The paths in the snapshot metadata are currently absolute paths as relative paths alone are meaningless. If there’s another way to identify backup sets, then we can obviously revisit that behavior.

alexweiss · August 31, 2024, 3:51am

I personally don’t find relative paths meaningless. I think if there are paths stored, the users should be able to control if these are relative or absolute ones.

One of the problems @damoclark mentioned was people having their backup source under changing paths and struggling with the parent detection. Relative paths can solve this problem - as does using some “label/name/whatever you call it” instead of paths for the parent detection.

rawtaz · August 31, 2024, 10:20pm

I fail to see the point of introducing a separate “field” or “storage point” for the suggested “name”. What you’re essentially suggesting here is that restic should be able to identify a snapshot by a “name”. So, just use a tag for that. In other words:

The first restic backup --name 'Backup Set Name' .. will add a tag named “Backup Set Name” to the snapshot.
The next restic backup --name 'Backup Set Name' .. will then find the latest snapshot having a tag “Backup Set Name” on it, and use that as the parent snapshot.

This ought to meet your suggestion while keeping things simple in the sense that we don’t introduce an additional field just to store the Backup Set Name you propose. Also, it doesn’t matter if the user adds additional tags to their snapshots, it has no relevance or bearing on the --name feature you propose.

damoclark · September 1, 2024, 4:16am

Thanks to everyone for contributing to this discussion. I’d like to start with the big-picture item first.

rawtaz has raised an important area of discussion - tags. I said:

While restic supports tags, I do not consider them in the three-dimension model. This is because every snapshot has a time, a host, and path. However, a snapshot can have many tags, one tag, or no tags associated with them. Thus, tags are not “exact” enough to assist with identifying parent snapshots.

Rawtaz said:

I fail to see the point of introducing a separate “field” or “storage point” for the suggested “name”. What you’re essentially suggesting here is that restic should be able to identify a snapshot by a “name”. So, just use a tag for that. In other words:

The first restic backup --name ‘Backup Set Name’ … will add a tag named “Backup Set Name” to the snapshot.

The next restic backup --name ‘Backup Set Name’ … will then find the latest snapshot having a tag “Backup Set Name” on it, and use that as the parent snapshot.

This ought to meet your suggestion while keeping things simple in the sense that we don’t introduce an additional field just to store the Backup Set Name you propose. Also, it doesn’t matter if the user adds additional tags to their snapshots, it has no relevance or bearing on the --name feature you propose.

This point has been raised before in the github issue and has been addressed. Reading on from this comment will catch people up alternate proposed solutions and rebuttals. Note that at that stage, discussion was in relation to a ‘label’, rather than name. Using a name rather than label emerged later - but the concept remains the same.

Philosophy

I’ve worked in the IT Industry for 30 years (and higher education for 25). I’ve found a handful of philosophies by far smarter people than I, that have served me well. A relevant one here is:

Make the simple things easy, and complicated things possible.

To break this down to the current conundrum, simple and complicated relate to user tasks, while easy and possible are our solutions from the perspective of a user. I contend that performing routine backups where restic matches the parent (for optimisation) is a simple thing from a user perspective. It should just happen. It doesn’t, and that is why the user forums are awash with confusion.

The approach you describe makes it possible, but I disagree that using tags as described to match parent snapshots is easy, at least from the user perspective. It puts an additional impost on users to strictly apply the strategy, and is not supported through defaults (e.g. requiring --group-by). It’s optional, not baked-in, and thus, not intrinsic to the snapshot model of restic itself. I see two big issues here:

There is a cognitive load for users to overcome, especially novice users to understand (and remember) how tags work and how to layout their “Backup Sets” with a tag as you describe, and apply --group-by; and,
because, as I said, “a snapshot can have many tags, one tag, or no tags associated with them … [they] are not ‘exact’ enough to assist with reliably identifying parent snapshots”. In other words, it is highly susceptible to human error. What if you forget to add the tag, or use the wrong tag (typo)?

By introducing a name to backup sets, and eventually requiring one, we can substantially reduce human error and remove the cognitive load of tags for novice users who just want to backup their stuff, and rely on sensible and intuitive defaults. And I personally think introducing the concept of “Backup Sets” and naming them makes sense as a core component of Restic’s snapshot model. Is naming things not an intrinsic human behaviour? To me, this is much simpler than what you propose, especially for novice users. And the evidence of this is revealed in the number of forum posts on the issue.

Detractors

Still, I recognise that this change introduces cognitive load to existing expert users such as yourself to reconceptualise how restic works.

See my comment that summarises this and other detractors to naming backup-sets, specifically:

Some cognitive-load for existing expert users to adjust to the new conceptual framework proposed (i.e. backup-sets). For example, users who currently use tags to name their backup-sets, as described by @RayZ0rr.

This of course is a very valid practice, likely common, and is obviously serving these people well.

It will require adjusting to the new conceptual framework proposed, both in terms of thinking and implementation (see point 1). This understandably will be a nuisance

But the switch to using backup-sets will still support the same practice (there is no loss of functionality); people can adapt to such changes; and everyone will benefit from the aforementioned advantages that backup-sets provide.

That comment also highlights a great many details that need to be considered, some addition ones discussed above, and for which I will turn my attention to next.

Damien.

damoclark · September 1, 2024, 4:57am

fd0:

I’ve moved this post to a new thread, it’s too much for the roadmap thread

Yeah, sorry about that. If I’m less verbose, it means more questions. But admittedly, more verbose, and well… The elusive sweet spot exists I’m sure.

Michael:

I definitely want to look into this soonish, although I first have to finish the fs code refactoring which will still take a few weeks.

No problem Michael.

Michael continues:

I still see the --name and -C options as two somewhat separate problems, although it’s probably better to discuss both and their interaction before deciding on what to implement in 0.18.

Yes, I agree. Analyse/Design/Plan both initially, but perhaps implement them in two steps. I understand that 0.17 was too much in one release.

And of course, --name enables a -C type implementation.

MichaelEischer:

damoclark:
restic backup -r repo --name rootfs -C /mnt/backup/root -C /mnt/backup home var
Would the snapshot then still include paths and if yes, which ones? Currently at least when using absolute path (but not for relative paths) the paths of a snapshot give a rough indication at which paths a snapshot contains its files.

Yes, great question. I’m thinking that the snapshot would still record, as metadata, all the -C paths and included relative directories therein, for each backup snapshot. It just wouldn’t be relied upon for parent matching. Just useful context as a record for the user when the snapshot was taken and how.

Yes, that is horrible.

Which makes me wonder how restic on windows copes now, with restoring a snapshot that was taken using a command like:

restic backup -r repo C:\ D:\

I don’t have a windows computer to experiment with, so I can’t just test it out myself. A restore happens in a target directory. How does restic now reconcile the drive letters in the absolute path of the backup to restoration targets? I am guessing that the restore replaced 'C:' and 'D:' with ‘C’ and ‘D’ as directories. E.g.

restic restore -r repo -t c:\ <snapshot id>
C:\
│   
├── C
│   ├── dir1
│   │   ├── file1
│   │   └── file2
│   └── dir2
│       ├── file1
│       └── file2
├── D
│   ├── dir1
│   │   ├── file1
│   │   └── file2
│   └── dir2
│       ├── file1
│       └── file2

Or does restic do something different?

In any case, what I proposed in this forum post is slightly different to what I proposed in the github issue. I am thinking now it might be a mistake.

The github issue proposes that restic backups are always relative to the current working directory (cwd), just like common command line archive tools like tar, zip etc.

This means that if you wish to make a backup, and it is unclear or not easy to set the cwd before running restic (invoked through some other tool), then that is where you use -C syntax to override the current working directory to make the backup relative to some place else. Or if you wish to backup from multiple disparate relative directories, then multiple uses of -C.

But, all backups would be relative to somewhere - the cwd by default, unless specified with -C.

And this, in effect, is changing the internal snapshot root path/s.

Would this solve the issue for windows as described? It would mean that the drive letter would never be included in the internal directory structure of a snapshot.

I think Alex is talking about internal representation for compatibility. This is the challenge with early adoption - Alex implemented the idea while the idea was still evolving.

My initial idea of labels for snapshot backup-sets was inspired by the concept of volume labels. But after further discussion in the github issue, I think Michael is right, especially for novice users, where a volume label isn’t particularly meaningful concept. To me, naming backup-sets is the most meaningful ‘term’ to use for the most people.

alexweiss · September 1, 2024, 6:22am

As @damoclark mentioned, this has been also discussed in the github issue. The main points are:

You are perfectly right, that in theory all what can be solved using this label/… cold be also solved using tags.
Note that your proposed tag solution also needs modifications to restic; what you are describing is not possible with the current version.
If you propose that some tag is special in the sense that it does identify the “what” (as @damoclark called it) and others don’t, wouldn’t it be simpler to just give it also a different name?