Searching files on disk running 28 times faster in Python than in Julia. What am I doing wrong?

The following code goes through whole disk (~12000 files) to get all the files having certain extension. Running the below script takes around 5.034 secs whereas running the similar python code takes just 0.176 secs.

using YAML
const imgExts = r"jpg|jpeg|png|gif"
const data::Dict = YAML.load(open("config.yaml"))

function listDir(dir_path::String)
    try
        for file in readdir(dir_path)
            if startswith(file, '.') || (joinpath(dir_path, file) in data["EXCLUDE"])
                continue
            end
            file = joinpath(dir_path, file)
            if isdir(file)
                listDir(file)
            else
                for ext in data["IMG_EXT"]
                    if endswith(file, ext)
                        println(file)
                    end
                end
            end
        end
    catch e
        if isa(e, Base.IOError)
            println("Operation not permitted in ", dir_path)
        else
            println(e)
        end
    end
end

for dir::String in data["PATHS"]::Vector{String}
    listDir(dir)
end

PS: @btime on the main loop gives,
94.326 ms (136748 allocations: 9.90 MiB)

1 Like
  1. What is the Python script you are comparing it to?
  2. What version of Julia are you using and what operating system are you using (the output of versioninfo())?
  3. Have you considered using walkdir?
2 Likes
  1. The Python Script:
import os
import yaml

with open("config.yaml", 'r') as yaml_reader:
    cfg = yaml.load(yaml_reader, Loader=yaml.Loader)

def listDir(dir_path: str):
    try:
        files = [os.path.join(dir_path, file) for file in os.listdir(dir_path) if not file.startswith('.')]
        files = [file for file in files if file not in cfg["EXCLUDE"]]
    except PermissionError:
        print(f"Operation not permitted in {dir_path}")
        return

    [print(file) for file in files if file.endswith(tuple(cfg["IMG_EXT"]))]

    for file in files:
        if os.path.isdir(file):
            listDir(file)

for dir in cfg["PATHS"]:
    if os.path.isdir(dir):
        listDir(dir)
julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin22.4.0)
  CPU: 8 Ă— Apple M2
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, westmere)
  Threads: 1 on 8 virtual cores
  1. Yes, but it will stop on encountering error but don’t want this to happen, you can see in the above code I have added a try-catch block for the same.

I’m not sure exactly what YAML.load is returning but Dict is an abstract type. Dict{String,Any} would be a concrete type.

It would be better to pass in data["IMG_EXT"] as an argument to the function rather than refer to it like this. I’m guessing that the type returning from data["IMG_EXT"] is not predictable at compile time, so this is going cause dynamic dispatch.

Another question is how are you timing everything? Were you counting compilation time earlier?

I admit I haven’t run the script, but I would imagine this kind of stuff is largelly bounded by disk operation speed, not the language you use to ask for the filename …

2 Likes
  • You are trying to say that this might be because of unpredictable data type causes not sufficient optimization on the first call? If that’s the case I am not comparing to compiled language where all this things might be a bottle neck, rather Python.
  • time julia search.jl
    
  • How should I “count” the compilation time?
  1. The script is just listing the files not reading them
  2. The Read Write speed of the disk is in orders of 400MBps, so I don’t think this should be an issue
  3. If this was even a bottleneck in any remote case then the same would be true for the python code

Right. That’s the main issue. The operation itself does not take very long. We’re probably going to spend more time loading YAML and compiling the function than anything.

We actually do not have enough to execute this ourselves since we have neither the YAML or the directories being scanned.

Try adding @time as follows so we can see where the time is being spent. That will also report the compilation time and garbage collection time.

@time using YAML
const imgExts = r"jpg|jpeg|png|gif"
@time const data::Dict = YAML.load(open("config.yaml"))

function listDir(dir_path::String)
    try
        for file in readdir(dir_path)
            if startswith(file, '.') || (joinpath(dir_path, file) in data["EXCLUDE"])
                continue
            end
            file = joinpath(dir_path, file)
            if isdir(file)
                listDir(file)
            else
                for ext in data["IMG_EXT"]
                    if endswith(file, ext)
                        println(file)
                    end
                end
            end
        end
    catch e
        if isa(e, Base.IOError)
            println("Operation not permitted in ", dir_path)
        else
            println(e)
        end
    end
end

for dir::String in data["PATHS"]::Vector{String}
    @time listDir(dir)
end

So, YAML.load itself is taking 3.27 secs is there any way to reduce the YAML.load time?

Could you show the actual output from the timings? The full information would be very useful.

  1. using YAML
  2. const data::Dict = YAML.load(open("config.yaml"))
  0.066841 seconds (44.15 k allocations: 3.698 MiB, 10.99% compilation time)
  3.272388 seconds (2.98 M allocations: 187.058 MiB, 2.17% gc time, 99.43% compilation time)
1 Like

What about the @time mkitti asked for in the main loop? Looking at the 3.34s chunk of the 5s runtime you provided so far, most of it (3.26s) is compilation, especially in the costly 2nd line. You do a couple calls in the 2nd line so it’s not clear whether YAML.load(::IOStream) or open(::String) is going through more compilation.

Minor annotation details:

  1. you don’t need to annotate data::Dict =, the const would automatically fix the type of data to the type of the assigned value, whether it’s annotated or not. Dict is abstract anyway so if that overrode const somehow, inference would worsen.
  2. the only thing the right-hand annotation data["PATHS"]::Vector{String} would do is throw an error if the type didn’t match. I really doubt the PATHS column in your file wouldn’t provide strings, and you don’t need to check for strings in the Python version.
  3. for dir::String in is an assignment, so it attempts a type conversion instead of just checking for matching types. Since the iterable is a known vector of strings, you don’t need the conversion.
1 Like

You could use PackageCompiler.jl to build an app. This will compile everything ahead of time for you.

It would be nice if the YAML contributors used PrecompileTools.jl to precompile common uses of their functions. That could be you if you are volunteering.

Is this a common workflow for you or part of something larger?

2 Likes

Another angle is that Python buffers printing by default; in Julia you need to do it yourself or use GitHub - JuliaIO/BufferedStreams.jl: Fast composable IO streams but see UX for flushing buffer · Issue #81 · JuliaIO/BufferedStreams.jl · GitHub.

1 Like