I would like to write a parallel for loop over files in a directory tree, open them and do some long computation. I started writing something like this:
@threads for (root, dirs, files) in walkdir(path)
open_and_process(files)
end
which does not work because length(::Channel{Any}) does not exist — understandable. It’s easy then to fix this with @spawn:
for (root, dirs, files) in walkdir(path)
@spawn open_and_process(files)
end
and wait for the tasks to complete in a second step. This second solution, however, does not seem to consider the JULIA_NUM_THREADS variable, in contrast to the @threads macro. Limiting the number of available threads is essential in my application, as I’m running on a cluster with dozens of cores and using all of them causes “too many open files” errors to pop up.
I see two ways of fixing this: 1) get a list of files beforehand, store them into an array and go back using @threads. 2) implement a bit of logic to @spawn a part of the tasks and wait for them to complete in an intermediate step before starting to spaw again.
Is there a simpler solution I’m not considering here?
(Julia version 1.4.1)
EDIT: a remark to option 1: since the file list is long and the disk I’m working on is rather slow, gathering the files in a list takes quite a bit of time. Not starting tasks as the files come makes it a little less efficient, from this point of view.