I am solving a problem that is highly parallel, and wondering how to make it as efficient as possible in Julia. I need to run a function (let’s call it do_work
) on 1D slices of 3D arrays written on disk. For example, if each array is structured as arr1[nz, ny, nx]
, the function do_work
will take as arguments several 1D arrays, e.g.:
result[j, i] = do_work(arr1[:, j, i], arr2[:, j, i], arr3[:, j, i])
In practice, the function do_work
also reads a few other arrays that are always 1D and other 1D slices of 4D arrays, but I think that is not too important right now. The run time of do_work
is typically 10-100 ms.
The challenge here is to read the arrays from disk and pass them to do_work
in an efficient and parallel way. These arrays are typically too large to fit in memory (all dimensions are in the order of 500-2000). The problem is perfectly parallel in the x and y dimensions – it does not matter which order the 1D slices are read and results written to result[j, i]
.
My initial implementation to solve this was to use Threads
for the parallelism. Here’s a simplified snippet:
function my_work()
arr1 = read_var(IN_FILE, "arr1")
arr2 = read_var(IN_FILE, "arr2")
arr3 = read_var(IN_FILE, "arr3")
nz, ny, nx = size(arr1)
result = zeros(ny, nx)
Threads.@threads for ij in CartesianIndices((ny, nx))
j, i = Tuple(ij)
result[j, i] = do_work(arr1[:, j, i], arr2[:, j, i], arr3[:, j, i])
end
h5write(OUTPUT_FILE, "result", result)
end
I find this generally works ok, but the speedup from Threads
is far from the perfect scaling (serial time / number of threads). Looking at the CPU usage and timings, it seems that when using more than 8 threads there are periods when the computation gets “stuck” (threads each using 100% CPU, but progress is very slow).
As for reading the files, I’ve tried two different approaches: using Mmap
(reading only as needed) and with HDF5 (reading whole arrays at once with h5read
). The Mmap
option is much slower, but often I cannot read the entire arrays into memory. Using h5read
also takes a long overhead to read the full arrays in the beginning.
I would really appreciate some advice on how to make this type of workload more efficient. Currently I am using HDF5 for both input and output files, but it should be possible to use any other format if better. And other suggestions for simple parallelism other than Threads
are also very welcome!