I would like to write a parallel for
loop over files in a directory tree, open them and do some long computation. I started writing something like this:
@threads for (root, dirs, files) in walkdir(path)
open_and_process(files)
end
which does not work because length(::Channel{Any})
does not exist — understandable. It’s easy then to fix this with @spawn
:
for (root, dirs, files) in walkdir(path)
@spawn open_and_process(files)
end
and wait for the tasks to complete in a second step. This second solution, however, does not seem to consider the JULIA_NUM_THREADS
variable, in contrast to the @threads
macro. Limiting the number of available threads is essential in my application, as I’m running on a cluster with dozens of cores and using all of them causes “too many open files” errors to pop up.
I see two ways of fixing this: 1) get a list of files beforehand, store them into an array and go back using @threads
. 2) implement a bit of logic to @spawn
a part of the tasks and wait for them to complete in an intermediate step before starting to spaw again.
Is there a simpler solution I’m not considering here?
(Julia version 1.4.1)
EDIT: a remark to option 1: since the file list is long and the disk I’m working on is rather slow, gathering the files in a list takes quite a bit of time. Not starting tasks as the files come makes it a little less efficient, from this point of view.