Exclude vs include behavior

jamesfm · September 4, 2019, 2:32am

I am setting up some restic repos for the first time, and I was surprised by some of the behavior of exclusion vs. inclusion.

Here is the scenario, I have a directory samples-files that contains 3 directories sample directories. In both cases, I want to start with 2 of the directories included, and then add in all directories.

Exclusion:
restic -r b2:<testing-repo>:exclude backup --verbose sample-files/ --exclude sample-files/sample1/

Inclusion:
restic -r b2:<testing-repo>:include backup --verbose sample-files/sample2/ sample-files/sample3/

(in my real use case I was using the files-from option, but my understanding is they are basically the same behavior)

At this point, these repos are the same. What changes is when I add in the missing directory and point the next backup to sample-files with no inclusion or exclusion.

What I see then is two snapshots in both repos, when I go to clean up both repos with restic forget -r b2:<testing-repo>:<repo-name> --keep-daily 1 the excluded repo forgets all by 1 snapshot, whereas the repo created by the include method, which has different paths, keeps both snapshots around.

My current use case is I am selectively deciding what files to include as I have a lot of junk on my NAS. It is far easier to take a whitelist approach and include files than it is to exclude, due to the amount of data. The issue is that snapshots stick around post forget operations since the path is changed as I add additional files. I can go in and manually clean up the snapshots, but that is less than ideal.

TLDR: Is there a way to selective include files but keep a single path so that snapshots get forgotten properly?

ProactiveServices · September 4, 2019, 2:05pm

It’s important to include the actual commands you’ve run, in case the command line you used works differently than you expect. Output from forget would also be helpful; run it with --verbose --dry-run and it’ll give a reason why a snapshot is being kept (feel free to redact and elide long lists of snapshots it would remove).

Restic will consider separately snapshots with different paths when picking candidates for forgetting. You can use the --group-by option to alter this behaviour.

You could run a backup with a single source, backed by whatever exclusions you want, but use the --tag option to group similar snapshots. This should make future forgets work in the way you wish.

cdhowie · September 4, 2019, 2:44pm

The way restic forget groups snapshots by default is by host and paths. Excluded paths are not remembered and so aren’t part of this grouping. The workflow you’re describing isn’t really something restic was designed for.

The only thing I can think of is to use tags to differentiate the different backups and use restic forget --group-by host,tags.

Alternatively, just back up each folder individually as this will allow restic forget to work properly by default, and it gives you more visibility into what happened with restic snapshots.

Keep in mind that deduplication happens across snapshots, even for different hosts or paths. Having a bunch of extra snapshots around until you get your exclude rules 100% right isn’t really that big of a deal.

jamesfm · September 7, 2019, 9:23pm

The output from the forget shows the previous snapshot isn’t forgotten due to the different path.

Using group-by with tags makes sense to me.

jamesfm · September 7, 2019, 9:35pm

Okay, your and ProactiveServices’ suggestion make sense.

I think what confuses me is that even using the default forget group behavior of hosts,paths. That forget seems to be operating on the combined paths versus the inclusion of paths.

Applying Policy: keep the last 7 daily, 5 weekly, 6 monthly snapshots
snapshots for (host [backup-restic], paths [/backup/path3, /backup/path1, /backup/path2]):
keep 1 snapshots:
ID        Time                 Host           Tags        Reasons           Paths
--------------------------------------------------------------------------------------------------------
7f79e88d  2019-09-03 16:07:50  backup-restic              daily snapshot    /backup/path3
                                                          weekly snapshot   /backup/path1
                                                          monthly snapshot  /backup/path2
--------------------------------------------------------------------------------------------------------
1 snapshots

snapshots for (host [backup-restic], paths [/backup/path1, /backup/path2]):
keep 1 snapshots:
ID        Time                 Host           Tags        Reasons           Paths
-------------------------------------------------------------------------------------------------------
7d7a808e  2019-09-03 15:57:56  backup-restic              daily snapshot    /backup/path1
                                                          weekly snapshot   /backup/path2
                                                          monthly snapshot
-------------------------------------------------------------------------------------------------------
1 snapshots

So in this case, restic doesn’t operate on the fact that path1 and path2 are included in the latest snapshot along with path3. Instead it considers path1 + path2 to be a unique grouping and path1 + path2 + path3 to be another unique grouping.

At this point I think I get how it works. It’s just odd to me how it works, though that gets into feature request territory.

jamesfm · September 7, 2019, 10:10pm

Quick follow up question, to make sure I am not going to screw up uploading GBs of data.

Let’s say I have 10 individual paths and I want to start uploading paths individually due to the size of all the combined paths. So I start by uploading path1, and then go back and upload just path2, etc.

After uploading all 10 paths, can I then backup the root path, rely on deduplication, and then clean up all snapshots but the last root directory, and have a backup that has all 10 paths?

moritzdietz · September 7, 2019, 11:11pm

This should work. Just don’t run any prune commands which would remove data that is not referenced in any snapshots.

cdhowie · September 7, 2019, 11:51pm

@moritzdietz I think they are saying that they are going to complete multiple backups, one of each path, then create another backup that encompasses all of the same folders. If this is true, running prune won’t matter because the data is still referenced by one of the single-folder snapshots.

@jamesfm Yes, this should work exactly how you’ve described.