Restic incremental backup report confusing

I just did my first incremental restic backup to an existing repository. A few days ago I did the first full backup to this same repository. In the meantime I had changed about 300 files. Why zero changed?
See pic. Why does it say “No parent snapshot found”? Why did it take only 31min to create 1.1TB worth of files?!
I have a feeling this incremental backup didn’t work, but the Restic report is not clear. Anyone know what’s going on here?
.
.


.
.
Here’s the terminal record from the initial full backup: (43 hours to do)

advait@advait-Bravo-15-A4DDR:/media/advait/Santhana2TB$ sudo restic init --repo /media/advait/Santhana2TB/rr
enter password for new repository: 
enter password again: 
created restic repository 7a28d66a1d at /media/advait/Santhana2TB/rr

Please note that knowledge of your password is required to access
the repository. Losing your password means that your data is
irrecoverably lost.
advait@advait-Bravo-15-A4DDR:/media/advait/Santhana2TB$ 
sudo restic -r /media/advait/Santhana2TB/rr --verbose backup ~/AUFs ~/Downloads ~/VirtualBoxVMs ~/snap
open repository
enter password for repository: 
repository 7a28d66a opened successfully, password is correct
created new cache in /root/.cache/restic
lock repository
load index files
start scan on [/home/advait/AUFs /home/advait/Downloads /home/advait/VirtualBoxVMs /home/advait/snap]
start backup on [/home/advait/AUFs /home/advait/Downloads /home/advait/VirtualBoxVMs /home/advait/snap]
scan finished in 5.346s: 217439 files, 1.012 TiB

Files:       217443 new,     0 changed,     0 unmodified
Dirs:        21925 new,     0 changed,     0 unmodified
Data Blobs:  927263 new
Tree Blobs:  21427 new
Added to the repo: 933.059 GiB

processed 217443 files, 1.012 TiB in 43:17:50
snapshot cc72d13f saved
advait@advait-Bravo-15-A4DDR:/media/advait/Santhana2TB$ l

Great question! First, deduplication is working just fine, the second backup only added ~215MiB to the repository.

In order to decide if it needs to re-read a file it has already seen, restic first needs to find a snapshot that contains the file (that’s the parent snasphot mentioned above). We’re still working on that function, but for now you need to pass exactly the same list of files/dirs again. Otherwise restic will fall back to the safe default of re-reading everything.

When you ran restic the second time, it did not find a snapshot of exactly the list of directories you passed to it. On the first run, the list was:

~/AUFs ~/Downloads ~/VirtualBoxVMs ~/snap

When you ran restic the second time, it was:

~/AUFs ~/Downloads ~/snap ~/Applications

So restic opted for the safe default and re-read all data. That took a long time (~31 minutes), but afterwards restic decided that it only needed to add a little bit of data that was not in the repository yet.

You can check with restic snapshots, it’ll also print the list of target directories.

When you run restic now again with the same set of directories it’ll be much faster and it will also know how many files have changed.

Please be aware that the retention policies applied by restic forget will group snapshots according to the list of directories they contain, so right now you will end up with two groups (for the two lists of dirs). If you don’t want that, you need to tell restic to only group by hostname using --group-by host (the default is --group-by host,paths).

3 Likes

Thanks for the details. What is the group by host command?
sudo restic something something --group by host
Could you give me the command I need to type to activate ‘group by host’? Thx.
I did a Restic docs search for “group-by host” but didn’t find anything useful.

So, IIUC, group-by host means that if I change the directories I’m backing up, Restic will incorporate those changes into the existing snapshot repo? (that sounds like what I want rather than a new snapshot chain.)

The --group-by option is for when you want to delete snapshots, not create them. As long as you have the disk space, there isn’t any rush to think about this now; get comfortable with the restic backup process first, and then start learning about snapshot deletion and --group-by here.

The challenge you’re facing right now is only about optimizing your backup process. It needs some additional info if you want it to run as fast as possible.

The presence of a parent snapshot speeds-up a backup because it gives restic something to compare against to quickly find the changes that need saving. The command usage is listed with restic backup --help. To be clear, your repo will be in the same state after a backup is run, irrespective of whether a parent was found automatically, or manually specified, or not used at all (and irrespective of which snapshot is used as a parent); this is just about backup speed.

In most use-cases restic will automatically find an appropriate parent, so there’s nothing you need to specify. In your case, as @fd0 says above, it’s failing to automatically find a parent because you’ve changed your list of target directories to backup (which is a perfectly OK thing to do). So you have two choices:

  1. Keep doing what you’re currently doing, and don’t stress that the first backup using a different combination of directories is slower. Subsequent backups with that combination will be faster.
  2. Optimize things as much as possible from the start by manually specifying a parent snapshot whenever you want to backup a different combination of directories. Choose your desired parent from the list provided by restic snapshots. Specify the chosen snapshot ID as a parent when you run the backup command.
1 Like

OK, thanks. So my understanding is if I add or remove directories from my backup list, Restic will then create a new parent snapshot. And if the directories don’t change, child snapshots (and data blobs) will just keep getting added onto that parent snapshot. Sound about right? Let me know if I’m misunderstanding. Thx.

The speed of incremental backups is not a concern to me. I know they’ll be pretty fast unless I change A LOT of files in such a way that very little dedup happens.

Suggestion: If a newbie like me changes the directories like I did, add the appropriate message in the Restic job log report. That way newbies like me won’t get confused (or will get less confused). Basically explaining why ‘no parent snapshot was found’ and saying ‘relax, it’s not a problem and here’s why…’.

Hmmm… nope, not quite there yet! But I think I understand your confusion.

I think the concept of a restic parent can be confusing/misleading just because of its name: the term “parent” is used widely in data science, where it is usually fundamental to the structure and arrangement of the information. That’s not what we’re talking about here.

A restic parent is just an existing snapshot that can be looked-at while making a new snapshot, with the sole purpose of speeding-up that task. There’s probably a better/less confusing term than “parent”, but I can’t think of one right now. It’s really like one aspect of a cache, but don’t use that term because restic’s cache is something else!

As soon as the backup is completed, the parent/“child” relationship is forgotten. If you were to study the restic repository format definition for example, you’ll find no reference to parents (and definitely no children!).

So, back to the specifics of your question. No, the concept of “creating a parent snapshot” makes no sense in restic, and a snapshot is never “added” onto another snapshot (remember, there’s no “full” or “incremental” backups in restic; all snapshots are created equal). The “parent” of a particular snapshot is just a temporary pointer to an existing “similar” snapshot which helps restic speed-up the creation of a new snapshot, and is then completely forgotten about.

1 Like

Suggestion: If a newbie like me changes the directories like I did, add the appropriate message in the Restic job log report. That way newbies like me won’t get confused (or will get less confused). Basically explaining why ‘no parent snapshot was found’ and saying ‘relax, it’s not a problem and here’s why…’.

Putting this kind of contextual info in the log may be one step too far, but given the potential for misunderstanding this particular term, it could benefit from a clearer definition in the docs. Ideally, RTFM would have been the answer to your question!

Thanks for all the details. I’ll keep using Restic and I’m sure I’ll slowly come to grasp the subtleties of how it works. I like the capability of going back in time to get earlier versions if needed.

Could someone mark this as Solved? Thx.