Exclude syntax confusion

Hello everyone,

I’m using restic 0.9.4 on Centos 7.x to back up a nontrivial (several terabytes) amount of data to B2. The process is started via cron, and is meant to take a snapshot of our NFS share every night at midnight.

Our data, typically falls into two categories – user created and therefore “precious”, and machine generated and therefore “trivial”.

for simplicity, all the ‘trivial’ data lives in a folder we call ‘footage’. Typically, the data footprint in the footage folder (which we don’t want to back up) overwhelms all the other data. Here’s an example:

[gene@tws09 cmn]$ sudo du -sh /projects/my_project
2.2T	/projects/my_project

[gene@tws09 cmn] cd /projects/my_project
[gene@tws09 cmn]$ sudo du -sh *
7.3G	assets
3.0M	deliverables
784M	edit
2.2T	footage
68K	onset
68K	supers

As you can see, the amount of data i’m interested in backing up on this project, adds up to approximately ~8.3G [as the 2.2T footage folder is being skipped].

now for the fun part:

[gene@tws09 pxx_010]$ ps -ef | grep restic
root      24431  95350 33 Mar23 ?        07:05:43 restic --exclude "**/footage/**" -r b2:my-bucket-2019 backup /projects/my_project

As you can see, this process has been running for over 7 hours.
Our internal network is 10Gbe, and we have a 1Gbe fiber internet uplink, which we know to be reliable, and while it’s true that b2 is not the fastest platform, I suspect that something else might be going on here.

Now, for the odd part, looking at the files that open/being accessed, I noticed that restic seems to be processing files that I’m explicitly asking it to ignore.

[gene@tws09 pxx_010]$ sudo lsof -c restic
restic  24431 root   24r      REG     0,42 27344764928 266637223/projects/my_project/cmn/footage/raw/my_giang_27G_file.

Am I using the exclude flag correctly? should restic be looking at this giant ‘trivial’ file under the footage folder?

If anyone has any insight, it would be greatly appreciated.

Best,

Gene

It’s odd that the file is open – I do know that the scanner will often descend into ignored directories when the exclude pattern contains a glob at the end (that is, it will descend into a directory called “footage” even though you and I can reason that it shouldn’t back up anything within, but restic isn’t that smart and crawls everything, testing each against the exclude patterns).

However, I don’t know why restic would open this file.

On repeated runs of lsof, does the same file show as open?

Thank you for the quick response cdhowie, it’s much appreciated.

Repeated runs of lsof do show that the file is still open (it’s one of several largeish files that are open by restic). The files are rather large (27G for that one in question), and iftop is showing network access hovering around 10MB/s in both directions [which i’m interpreting as reads from the NFS mount vs upstream uploads to B2]. CPU usage is trivial.

Do you know how big the repository is? Has it grown much past the 8.3GB you expected to be added?

Good question – i’m unable to determine that at the moment as this is the initial write for this project. Once the process completes, I will get an email report with the amount of data that’s added to the bucket.

Currently we write all the data created ‘this year’ into a yearly bucket, so we expect to end up with nightly snapshots for all projects in the ‘2019 bucket’ – i’m not certain whether this is a best practice either, tbh.

You should be able to check with rclone size or using the B2 web console.

please bear with me – i’m a bit of a restic noob.

When I query restic for the snapshots using restic snapshots -r b2:my_bucket2019, I do not yet see an entry for the current project as it’s still running. I have no target at which to point rclone size to at this stage – unless i’m misunderstanding.

Sure you do – the B2 bucket along with the prefix of the repository (if there is a prefix). There isn’t a final snapshot yet, but restic has presumably been pushing data continuously, and that data exists as data packs in B2. (This works regardless of the backend. For example, when backing up to a local directory, you can run du against that directory even while a backup is running.)

Note that each snapshot is not a single file; it’s potentially thousands of individual ~4-8MB “pack” files. Restic will assemble a pack locally and once it gets close to 8MB, restic pushes it to the backend storage and starts working on the next pack.

The packs can additionally be shared between multiple snapshots (this is how the deduplication mechanism works).

Apologies for the reply spaghetti, and the deleted posts. I suspect I know where things went sideways in relation to being able to debug this.

Instead of creating a repo per project, I naively created a single repo at the root level of the bucket, into which i’m dumping all the projects. This may or may not come back to haunt me [gut check, I feel that it will] – this likely means that I wouldn’t be able to determine the data footprint of the packs transferred so far. I will likely need to reorganize this at some point.

[edit: with this new info I’ve decided to go back and adjust the back up script to first attempt to create a project-specific repo in the bucket – I’ve created new buckets for this purpose, so in theory once my new upload has completed (of all the projects, ugh!), I should be able to simply delete the old buckets. It feels better organized – @cdhowie thank you for the tip.]

Having said that, is there anything else worth considering with regards to why these (large?) files are being looked at by restic? Is it safe to assume that this is a likely cause of the slowdown?

[post-post edit: this experiment definitely highlighted the fact that there is something amiss. The first project being archived shows as follows:

[gene@tws09 cmn]$ du -sh *
74M deliverables
58M edit
127G footage
68K onset
68K supers

(expecting about 132MB of data to be written)

and now that i’m able to see the individual project repos with the web interface, the upload is reporting 35GB uploaded to b2 thus far.]

I think if you pass -v to the backup command, it will give you a list of the files it’s uploading. You can use this to troubleshoot your exclude patterns. (Maybe try backing up to a local filesystem repository to test, until you have the patterns working the way you want.)

I set up a not-dissimilar configuration with all but one small directory excluded and I then removed the exclusions progressively as I was ready for additional chunks of data to backup.

While I didn’t monitor the file system directly, restic definitely didn’t read the entire contents of excluded directories as the initial backup finished faster than would be possible. I can’t speak to whether restic wandered the directory structure.

What is the exact exclude command you’re using?

Below is the command string i’m using:

restic --exclude "**/footage/**" -r b2:my-bucket-2019 backup /projects/my_project

The simplified directory structure is:

19-0564_my_project
└── cmn
    ├── deliverables
    ├── edit
    ├── footage
    │   ├── graded
    │   ├── raw
    │   └── ungraded
    ├── onset
    ├── supers

I’m hoping to exclude all the files inside/under the footage directory, but i’m clearly mangling the exclude string. Any insight would be most appreciated.

@gened Can you try just --exclude footage and see if that works?

Thanks @cdhowie

Here’s a twist I did not expect. Both --exclude footage and --exclude **/footage/** work. The quotes seem to be what’s breaking it.

Thank you muchly for talking me through this.

G

That absolutely should not be the case, unless your shell does something really weird with double quotes.

What shell are you using? Do you see any difference between the output of these two commands?

echo **/footage/**
echo "**/footage/**"

using bash on centos 7.4

[gene@tws09 ~]$ echo **/footage/**
**/footage/**
[gene@tws09 ~]$ echo "**/footage/**"
**/footage/**

Sorry, I should have specified – can you run those commands from within the working directory that restic is run from?

[gene@tws09 19-0557_myproject]$ echo **/footage/**
cmn/footage/From_client/footage/graded cmn/footage/raw cmn/footage/ungraded
[gene@tws09 19-0557_myproject]$ echo "**/footage/**"
**/footage/**

Okay, so when you use --exclude **/footage/** in that directory, it’s expanding to:

--exclude cmn/footage/From_client/footage/graded cmn/footage/raw cmn/footage/ungraded

… which is NOT what you want. This excludes the first directory cmn/footage/From_client/footage/graded but then passes the other two as arguments to backup, which includes them.

I would suggest that you stick with --exclude footage.

@fd0 Are you aware of any issue with leading ** in an exclude pattern?

ps shouldn’t show any quotes there so it looks like you’re excluding files actually having double quotes in the name. Usually, you would get something like this if you use something like \"**/footage/**\" or '"**/footage/**"'. You briefly mentioned a backup script, can you paste it (with credentials redacted of course)?

1 Like