Too many open files...?

akrabu · November 4, 2021, 8:54pm

Hmm… I recently made a 10TB exFAT volume on MacOS, and a repo for use with Windows and Mac both. I decided to run a prune job on it and just received this:

repository 6a4c65bd opened successfully, password is correct
loading indexes…
loading all snapshots…
finding data that is still in use for 227 snapshots
[7:47] 100.00% 227 / 227 snapshots
searching used packs…
collecting packs for deletion and repacking
List(data) returned error, retrying after 552.330144ms: fcntl /Users/smitmark/tmp/7/Backup/.restic-ohsu/data/2e: too many open files
List(data) returned error, retrying after 1.080381816s: fcntl /Users/smitmark/tmp/7/Backup/.restic-ohsu/data/2e: too many open files
List(data) returned error, retrying after 1.31013006s: fcntl /Users/smitmark/tmp/7/Backup/.restic-ohsu/data: too many open files
List(data) returned error, retrying after 1.582392691s: fcntl /Users/smitmark/tmp/7/Backup/.restic-ohsu/data: too many open files
List(data) returned error, retrying after 2.340488664s: fcntl /Users/smitmark/tmp/7/Backup/.restic-ohsu/data: too many open files
List(data) returned error, retrying after 4.506218855s: fcntl /Users/smitmark/tmp/7/Backup/.restic-ohsu/data: too many open files
List(data) returned error, retrying after 3.221479586s: fcntl /Users/smitmark/tmp/7/Backup/.restic-ohsu/data: too many open files
List(data) returned error, retrying after 5.608623477s: fcntl /Users/smitmark/tmp/7/Backup/.restic-ohsu/data: too many open files
List(data) returned error, retrying after 7.649837917s: fcntl /Users/smitmark/tmp/7/Backup/.restic-ohsu/data: too many open files
List(data) returned error, retrying after 15.394871241s: fcntl /Users/smitmark/tmp/7/Backup/.restic-ohsu/data/2e: too many open files
[5:26] 95.30% 569793 / 597881 packs processed
fcntl /Users/smitmark/tmp/7/Backup/.restic-ohsu/data: too many open files
smitmark@RJHB595 ~ %

Wondering if I needed to use a larger file allocation unit (I used 128k), or if this is a general exFAT issue, or what? But it terminated before it could complete and now I’m unsure what to do.

Edit: May have found an answer here:

A simple fix for the “too many files open” limitation of Mac OS is to use the "ulimit - n" command. Curiously, the value of n appears to be critical to whether or not this command is accepted by MacOS.

I’ve found that ulimit -n 10240 (the default is 256) works but n values higher do not. 10240 files is probably more that enough for most users.

. . .

Adding the “ulimit -n 10240” statement to your bash profile using sudo nano .bash_profile makes it permanent.

rawtaz · November 4, 2021, 9:09pm

Exactly, it’s a matter of the limitations your shell runs under at the time, it’s outside of restic and not something restic does. Here’s an article on the same subject: How to fix 'Too Many Open Files' in Linux

Did it work for you if you increased it to a higher value?

akrabu · November 4, 2021, 9:10pm

Currently running a restic check to ensure there isn’t any corruption. Should I do a rebuild-index anyway before attempting a prune, you think?

rawtaz · November 4, 2021, 9:11pm

Given the output you showed it doesn’t look like prune did any writing, it was still just trying to collect information about your repository. So unless there was more to the output than what you showed above, I wouldn’t worry and wouldn’t feel a need to rebuild the index.

akrabu · November 4, 2021, 9:15pm

Cool, yeah it did just say it was collecting packs for deletion and repacking - but I wasn’t certain if that was collecting pacts for deletion, AND repacking as it goes haha. Figured a check wouldn’t hurt regardless. I’ll try a prune after and report back!

Also, would there be any reason that exFAT would specifically do this over HFS+ or APFS? I had just cloned this database over from an HFS+ volume and it worked perfectly fine. I’ve never seen this error before. I don’t typically put restic databases on exFAT, cause it’s so darn slow with many small files. But this is a 3TB database and I really wanted Windows / MacOS interoperability. I was wondering if the overhead from 128k allocation units was the culprit (it’s a 10TB volume), or if it’s really just how many files are in a directory and it’s file system independent and just a coincidence that I finally hit that limit.

alexweiss · November 5, 2021, 8:13am

In any case there is no need to run rebuild-index to ensure that prune doesn’t do any harm: If there is anything missing in the index, prune will complain and abort without changing the repo.

On the other hand, if the index is fine, rebuild-index is very fast and doesn’t do any harm

akrabu · November 5, 2021, 3:40pm

@rawtaz It worked like a charm! I’m still very curious if this is an exFAT issue or not… going to clone my repo to an APFS volume and undo the ulimit change and see what it does out of curiosity. Thanks for helping me work through this “out loud” haha.

@alexweiss Hmm in the past, if prune messed up or couldn’t continue, I’ve often had to run rebuild-index - but I think that’s because I was using pCloud as a backend, and it didn’t discard partially uploaded files (and restic would think they were full files).

So hypothetical situation… my computer loses power in the middle of a prune operation - is there any point in the prune process that a rebuild-index would be necessary to recover?

MichaelEischer · November 5, 2021, 7:27pm

Hitting the limit of open file descriptors sounds a bit like there’s a bug somewhere. Restic should only use a relatively low two-digit number of file descriptors at a time. Which restic version are you using?

That depends a bit on the restic version. In principle it shouldn’t be necessary, although there are a few corner cases which are not handled properly yet. The next restic release will take care of a few of those (by using atomic cache and local backend writes). But in general, unless prune complains there’s no need to run rebuild-index.

akrabu · November 5, 2021, 7:41pm

@MichaelEischer I’m on restic 0.12.1 compiled with go1.16.6 on darwin/amd64, on MacOS Catalina.

MichaelEischer · November 5, 2021, 9:05pm

If you can reproduce the problem please either use lsof -p <pid of restic> or the macOS activity monitor (select restic, open the process information and switch to “open files and ports”). Which files are reported there?

rawtaz · November 6, 2021, 1:52pm

I did what Michael suggested and got at most 18 files in use at the same time, e.g.:

COMMAND   PID   USER   FD     TYPE             DEVICE  SIZE/OFF        NODE NAME
restic  15674 USER  cwd      DIR                1,4       384     8452342 /Users/USER/bin
restic  15674 USER  txt      REG                1,4  21198408     6582174 /Users/USER/go/bin/restic
restic  15674 USER  txt      REG                1,4     21024 12894749047 /Library/Preferences/Logging/.plist-cache.vEpy0Jc7
restic  15674 USER  txt      REG                1,4    972368 12892568910 /usr/lib/dyld
restic  15674 USER    0u     CHR               16,0 0x9d093d4         655 /dev/ttys000
restic  15674 USER    1u     CHR               16,0 0x9d093d4         655 /dev/ttys000
restic  15674 USER    2u     CHR               16,0 0x9d093d4         655 /dev/ttys000
restic  15674 USER    3     PIPE 0x9e1f98cac6661da9     16384             ->0x9e1f98cac66608a9
restic  15674 USER    4     PIPE 0x9e1f98cac66608a9     16384             ->0x9e1f98cac6661da9
restic  15674 USER    5u  KQUEUE                                          count=0, state=0xa
restic  15674 USER    6     PIPE 0x9e1f98cac6660129     16384             ->0x9e1f98cac6660de9
restic  15674 USER    7     PIPE 0x9e1f98cac6660de9     16384             ->0x9e1f98cac6660129
restic  15674 USER    8r     CHR               14,1    0t8192         584 /dev/urandom
restic  15674 USER    9u    IPv4 0x9e1f98caee00b179       0t0         TCP 192.168.184.7:54766->1.2.3.4:https (ESTABLISHED)
restic  15674 USER   10r     DIR                1,4       224     1019022 /Users/USER/Documents/foo
restic  15674 USER   11r     DIR                1,4       224     1019022 /Users/USER/Documents/foo
restic  15674 USER   12u     REG                1,4      5109 12898844647 /private/var/folders/rx/s_h0zqcd2sj7r70299njxk180000gp/T/restic-temp-pack-367536779
restic  15674 USER   13u     REG                1,4     39746 12898844648 /private/var/folders/rx/s_h0zqcd2sj7r70299njxk180000gp/T/restic-temp-pack-863827566

This was during the entire process, I checked like once every second.

akrabu · November 6, 2021, 6:17pm

Hmm I couldn’t reproduce it with --dry-run. Unfortunately I’ve already ran prune after adding ulimit -n 10240 to my .zshrc profile. So technically it’s not really the same conditions, even without that command. I have added a large ~400GB snapshot by rcloning it, which introduced 199GB of dupes that --dry-run said it will remove.

Going to try it without --dry-run and see what happens.

to repack: 872718 blobs / 412.713 GiB
this removes 358561 blobs / 198.619 GiB
to delete: 0 blobs / 1.327 GiB
total prune: 358561 blobs / 199.946 GiB
remaining: 8319142 blobs / 3.250 TiB
unused size after prune: 0 B (0.00% of remaining size

akrabu · November 6, 2021, 6:29pm

Nope. Couldn’t reproduce it. I’ll keep the ulimit line commented out of my .zshrc profile and be sure to log my next prune again.

MichaelEischer · November 6, 2021, 6:49pm

Thanks for the screenshot. I’m pretty sure I’ve fixed the bug in local: Fix fd leak when encountering files directly inside data/ by MichaelEischer · Pull Request #3568 · restic/restic · GitHub

akrabu · November 17, 2021, 12:28am

Screen Shot 2021-11-16 at 4.27.37 PM

Yeah, I swear this has something to do with this being an exFAT volume. I’ve copied 24TB with RapidCopy before with no issues. Lots of small files, restic databases, etc.

Going to try to move it to another volume that’s HFS+ instead and see what it does.

akrabu · November 30, 2021, 6:36pm

Figured it out. The problem was two-fold.

So there’s apparently a bug with AppleDouble “._*” files. You can’t always access them properly from an exFAT volume. Restic, Rclone, Rsync, RapidCopy, and even CP were all failing. After deleting them, both rsync and RapidCopy functioned properly.

In addition, exFAT folder enumeration on a Mac gets EXTREMELY SLOW once you get several thousand files in a folder. It was taking a couple minutes to copy each blob. After removing the AppleDouble files, which also took forever, the drive began functioning much more normally. Of course, they’ll be regenerated over time, so I have a RapidCopy sync going while excluding “._*”

Some of the errors I had gotten had me thinking the drives were failing, but no… SMART checks out and it’s behaving normally now that each folder has about half of the files it used to have in it.

I nearly have the whole thing copied over to a HFS+ disk instead. Lesson learned!

rawtaz · December 4, 2021, 11:01pm

@akrabu Can you please mark the answer you think is the most applicable one as the solution to this thread? I’m not sure if it’s Michael’s or your last one :3

akrabu · December 5, 2021, 12:44am

Haha I’m not sure either! I haven’t tried the PR yet and kind of went my own way with it. I did have two corrupted blobs after the whole thing was said and done, so I re-backed up and it added the blobs and everything appears to be fine now.

Buuuut I think I also did rsync /data to /data/data accidentally and that was the initial issue. So I’ll mark his as the solution. But it’s also good to note that exFAT may not play nicely with large Restic repos and AppleDouble files on a Mac!