The following code goes through whole disk (~12000 files) to get all the files having certain extension. Running the below script takes around 5.034 secs whereas running the similar python code takes just 0.176 secs.
using YAML
const imgExts = r"jpg|jpeg|png|gif"
const data::Dict = YAML.load(open("config.yaml"))
function listDir(dir_path::String)
try
for file in readdir(dir_path)
if startswith(file, '.') || (joinpath(dir_path, file) in data["EXCLUDE"])
continue
end
file = joinpath(dir_path, file)
if isdir(file)
listDir(file)
else
for ext in data["IMG_EXT"]
if endswith(file, ext)
println(file)
end
end
end
end
catch e
if isa(e, Base.IOError)
println("Operation not permitted in ", dir_path)
else
println(e)
end
end
end
for dir::String in data["PATHS"]::Vector{String}
listDir(dir)
end
PS: @btime on the main loop gives, 94.326 ms (136748 allocations: 9.90 MiB)
import os
import yaml
with open("config.yaml", 'r') as yaml_reader:
cfg = yaml.load(yaml_reader, Loader=yaml.Loader)
def listDir(dir_path: str):
try:
files = [os.path.join(dir_path, file) for file in os.listdir(dir_path) if not file.startswith('.')]
files = [file for file in files if file not in cfg["EXCLUDE"]]
except PermissionError:
print(f"Operation not permitted in {dir_path}")
return
[print(file) for file in files if file.endswith(tuple(cfg["IMG_EXT"]))]
for file in files:
if os.path.isdir(file):
listDir(file)
for dir in cfg["PATHS"]:
if os.path.isdir(dir):
listDir(dir)
julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin22.4.0)
CPU: 8 Ă— Apple M2
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, westmere)
Threads: 1 on 8 virtual cores
Yes, but it will stop on encountering error but don’t want this to happen, you can see in the above code I have added a try-catch block for the same.
I’m not sure exactly what YAML.load is returning but Dict is an abstract type. Dict{String,Any} would be a concrete type.
It would be better to pass in data["IMG_EXT"] as an argument to the function rather than refer to it like this. I’m guessing that the type returning from data["IMG_EXT"] is not predictable at compile time, so this is going cause dynamic dispatch.
Another question is how are you timing everything? Were you counting compilation time earlier?
I admit I haven’t run the script, but I would imagine this kind of stuff is largelly bounded by disk operation speed, not the language you use to ask for the filename …
You are trying to say that this might be because of unpredictable data type causes not sufficient optimization on the first call? If that’s the case I am not comparing to compiled language where all this things might be a bottle neck, rather Python.
Right. That’s the main issue. The operation itself does not take very long. We’re probably going to spend more time loading YAML and compiling the function than anything.
We actually do not have enough to execute this ourselves since we have neither the YAML or the directories being scanned.
Try adding @time as follows so we can see where the time is being spent. That will also report the compilation time and garbage collection time.
@time using YAML
const imgExts = r"jpg|jpeg|png|gif"
@time const data::Dict = YAML.load(open("config.yaml"))
function listDir(dir_path::String)
try
for file in readdir(dir_path)
if startswith(file, '.') || (joinpath(dir_path, file) in data["EXCLUDE"])
continue
end
file = joinpath(dir_path, file)
if isdir(file)
listDir(file)
else
for ext in data["IMG_EXT"]
if endswith(file, ext)
println(file)
end
end
end
end
catch e
if isa(e, Base.IOError)
println("Operation not permitted in ", dir_path)
else
println(e)
end
end
end
for dir::String in data["PATHS"]::Vector{String}
@time listDir(dir)
end
What about the @time mkitti asked for in the main loop? Looking at the 3.34s chunk of the 5s runtime you provided so far, most of it (3.26s) is compilation, especially in the costly 2nd line. You do a couple calls in the 2nd line so it’s not clear whether YAML.load(::IOStream) or open(::String) is going through more compilation.
Minor annotation details:
you don’t need to annotate data::Dict =, the const would automatically fix the type of data to the type of the assigned value, whether it’s annotated or not. Dict is abstract anyway so if that overrode const somehow, inference would worsen.
the only thing the right-hand annotation data["PATHS"]::Vector{String} would do is throw an error if the type didn’t match. I really doubt the PATHS column in your file wouldn’t provide strings, and you don’t need to check for strings in the Python version.
for dir::String in is an assignment, so it attempts a type conversion instead of just checking for matching types. Since the iterable is a known vector of strings, you don’t need the conversion.
You could use PackageCompiler.jl to build an app. This will compile everything ahead of time for you.
It would be nice if the YAML contributors used PrecompileTools.jl to precompile common uses of their functions. That could be you if you are volunteering.
Is this a common workflow for you or part of something larger?