Why is os.walk() + regex so much slower than glob

Hi!

So myself and another coworker were both writing julia code that basically did the same thing as a specific GNU find command.

For my implementation, I used a regex for the filename and walkdir to find the files, something like this:

for (root, dirs, files) in walkdir(".")
    #This is how I handled not going into directories with false information that I didn't want to parse later
     deleteat!(dirs,findall(x->"MisleadingDirectoryName",dirs))
     for file in files
           if match(MATCH_REGEX, file).match
                #do input
           end
      end
end

Whereas my coworker did

files = glob("MATCH_GLOB", ".")
for file in files 
    if occursin("MisleadingDirectoryName", file)  && continue
    #do input
    end
end

Given this, his code finishes on a huge directory structure in 20 seconds, whereas mine takes about 20 minutes. He has been coding in julia a lot longer so his code is way more elegant, but my question is, why does it run so much faster? Is regex just that bad? Is it because walkdir is slow?

Thanks!

1 Like

walkdir is probably just written badly for handling nested directory trees. I may fix this.

2 Likes

I merged https://github.com/JuliaLang/julia/pull/36856 just now, so I’d be happy to hear how much that improved your case. They are still a bit different in that glob doesn’t recursively visit directories (more like readdir), so it may still do more work, depending on what your coworker did to get around that.

You can download the same binary as CI used by running the command from Buildbot

4 Likes