Exclude syntax confusion

Apologies for the reply spaghetti, and the deleted posts. I suspect I know where things went sideways in relation to being able to debug this.

Instead of creating a repo per project, I naively created a single repo at the root level of the bucket, into which i’m dumping all the projects. This may or may not come back to haunt me [gut check, I feel that it will] – this likely means that I wouldn’t be able to determine the data footprint of the packs transferred so far. I will likely need to reorganize this at some point.

[edit: with this new info I’ve decided to go back and adjust the back up script to first attempt to create a project-specific repo in the bucket – I’ve created new buckets for this purpose, so in theory once my new upload has completed (of all the projects, ugh!), I should be able to simply delete the old buckets. It feels better organized – @cdhowie thank you for the tip.]

Having said that, is there anything else worth considering with regards to why these (large?) files are being looked at by restic? Is it safe to assume that this is a likely cause of the slowdown?

[post-post edit: this experiment definitely highlighted the fact that there is something amiss. The first project being archived shows as follows:

[gene@tws09 cmn]$ du -sh *
74M deliverables
58M edit
127G footage
68K onset
68K supers

(expecting about 132MB of data to be written)

and now that i’m able to see the individual project repos with the web interface, the upload is reporting 35GB uploaded to b2 thus far.]

I think if you pass -v to the backup command, it will give you a list of the files it’s uploading. You can use this to troubleshoot your exclude patterns. (Maybe try backing up to a local filesystem repository to test, until you have the patterns working the way you want.)

I set up a not-dissimilar configuration with all but one small directory excluded and I then removed the exclusions progressively as I was ready for additional chunks of data to backup.

While I didn’t monitor the file system directly, restic definitely didn’t read the entire contents of excluded directories as the initial backup finished faster than would be possible. I can’t speak to whether restic wandered the directory structure.

What is the exact exclude command you’re using?

Below is the command string i’m using:

restic --exclude "**/footage/**" -r b2:my-bucket-2019 backup /projects/my_project

The simplified directory structure is:

19-0564_my_project
└── cmn
    ├── deliverables
    ├── edit
    ├── footage
    │   ├── graded
    │   ├── raw
    │   └── ungraded
    ├── onset
    ├── supers

I’m hoping to exclude all the files inside/under the footage directory, but i’m clearly mangling the exclude string. Any insight would be most appreciated.

@gened Can you try just --exclude footage and see if that works?

Thanks @cdhowie

Here’s a twist I did not expect. Both --exclude footage and --exclude **/footage/** work. The quotes seem to be what’s breaking it.

Thank you muchly for talking me through this.

G

That absolutely should not be the case, unless your shell does something really weird with double quotes.

What shell are you using? Do you see any difference between the output of these two commands?

echo **/footage/**
echo "**/footage/**"

using bash on centos 7.4

[gene@tws09 ~]$ echo **/footage/**
**/footage/**
[gene@tws09 ~]$ echo "**/footage/**"
**/footage/**

Sorry, I should have specified – can you run those commands from within the working directory that restic is run from?

[gene@tws09 19-0557_myproject]$ echo **/footage/**
cmn/footage/From_client/footage/graded cmn/footage/raw cmn/footage/ungraded
[gene@tws09 19-0557_myproject]$ echo "**/footage/**"
**/footage/**

Okay, so when you use --exclude **/footage/** in that directory, it’s expanding to:

--exclude cmn/footage/From_client/footage/graded cmn/footage/raw cmn/footage/ungraded

… which is NOT what you want. This excludes the first directory cmn/footage/From_client/footage/graded but then passes the other two as arguments to backup, which includes them.

I would suggest that you stick with --exclude footage.

@fd0 Are you aware of any issue with leading ** in an exclude pattern?

ps shouldn’t show any quotes there so it looks like you’re excluding files actually having double quotes in the name. Usually, you would get something like this if you use something like \"**/footage/**\" or '"**/footage/**"'. You briefly mentioned a backup script, can you paste it (with credentials redacted of course)?

1 Like

Very good catch, I totally missed that!

Thanks @cdhowie, and @Julian for helping me resolve this issue.

Perhaps as a habit, I put the excluded parameter in quotes (trying to avoid shell expansion of the expression)

I called out my mistake in the post above.

Once again, thank you both.

G

Right, but we’re saying that --exclude="**/footage/**" should work because the shell erases the quotes. However, based on the ps output it looks like the quotes were actually passed to restic.

Were you doing something like --exclude='"**/footage/**"'?

We’re just trying to understand what happened.

I hear ya @cdhowie.

I don’t think the shell erases the quotes. I think the shell will make substitutions if the patten falls into the category of something it knows about for example:

if I call du -sh *c* the shell substitutes the c for something it knows about – I say this because invoking ps on the du process shows:

gene  [gene@tws09 ~]$ ps -ef | grep du
41185  38406  6 11:49 pts/4    00:00:00 du -sh cal cmn

instictively (and wrongly in this case) I put the expression in quotes hoping to prevent the shell from getting involved in the globbing process.

Hope that clarifies.

G

Yes, it should. Compare the output of these commands:

chris@liz:~$ echo --test=foo
--test=foo
chris@liz:~$ echo --test="foo"
--test=foo

Do you get different output?

That’s the strange thing – what you did was actually right. The quotes should be erased by the shell.

The only guess I have is that you are calling restic from a script, and the shebang line in the script invokes a shell that doesn’t erase quotes, and that’s a long shot.

Ha. That’s awesome – I did not know that the shell behaves this way with quotes. Thanks for the schooling. I will watch out for that in the future.
And yes, you are 100% correct as to the script. I’m invoking restic from a python subprocess, which probably trips this behaviour up.

cmd = 'restic -o b2.connections=20 --exclude footage -r b2:projects-{}:{} backup {}'.format(proj_creation_year, project, proj_path)
lcmd = cmd.split(' ')
ps = subprocess.Popen(lcmd, env=restic_env, stdout = subprocess.PIPE, stderr = subprocess.STDOUT)

well – the token says footage now but it used to say "**/footage/**" which is where this whole thing went south for me! :wink:

2 Likes

Ah, yup, there it is! subprocess.Popen() doesn’t invoke a shell, it takes your arguments array and passes them to the target program as-is. This means that quotes and * characters alike are untouched, so using **/footage/** without quotes here is safe.

And, of course, using quotes means they get passed to restic, which will dutifully ignore any paths that start and end with a " character, and have /footage/ somewhere in them. :slight_smile:

Thanks for satisfying my curiosity.


Oh and look, you aren’t the first person to trip up here. I knew this sounded familiar… I helped someone else out with this exact issue!

1 Like

heh - well, at least i’m not alone.

Thank you for your patience in getting to the bottom of this. TIL.

On a side note, I would not be surprised to see the " character in a file name one day… It will be a dark day. :wink:

2 Likes