Advice on speeding up highly parallel and I/O intensive code

I am solving a problem that is highly parallel, and wondering how to make it as efficient as possible in Julia. I need to run a function (let’s call it do_work) on 1D slices of 3D arrays written on disk. For example, if each array is structured as arr1[nz, ny, nx], the function do_work will take as arguments several 1D arrays, e.g.:

result[j, i] = do_work(arr1[:, j, i], arr2[:, j, i], arr3[:, j, i])

In practice, the function do_work also reads a few other arrays that are always 1D and other 1D slices of 4D arrays, but I think that is not too important right now. The run time of do_work is typically 10-100 ms.

The challenge here is to read the arrays from disk and pass them to do_work in an efficient and parallel way. These arrays are typically too large to fit in memory (all dimensions are in the order of 500-2000). The problem is perfectly parallel in the x and y dimensions – it does not matter which order the 1D slices are read and results written to result[j, i].

My initial implementation to solve this was to use Threads for the parallelism. Here’s a simplified snippet:

function my_work()
    arr1 = read_var(IN_FILE, "arr1")
    arr2 = read_var(IN_FILE, "arr2")
    arr3 = read_var(IN_FILE, "arr3")
    nz, ny, nx = size(arr1)
    result = zeros(ny, nx)
    Threads.@threads for ij in CartesianIndices((ny, nx))
        j, i = Tuple(ij)
        result[j, i] = do_work(arr1[:, j, i], arr2[:, j, i], arr3[:, j, i])
    h5write(OUTPUT_FILE, "result", result)

I find this generally works ok, but the speedup from Threads is far from the perfect scaling (serial time / number of threads). Looking at the CPU usage and timings, it seems that when using more than 8 threads there are periods when the computation gets “stuck” (threads each using 100% CPU, but progress is very slow).

As for reading the files, I’ve tried two different approaches: using Mmap (reading only as needed) and with HDF5 (reading whole arrays at once with h5read). The Mmap option is much slower, but often I cannot read the entire arrays into memory. Using h5read also takes a long overhead to read the full arrays in the beginning.

I would really appreciate some advice on how to make this type of workload more efficient. Currently I am using HDF5 for both input and output files, but it should be possible to use any other format if better. And other suggestions for simple parallelism other than Threads are also very welcome!

I’m not sure about the best way to handle out-of-memory data, but you can at least try to avoid unnecessary extra allocations when calling do_work, by using views instead of slices:

result[j, i] = @views do_work(arr1[:, j, i], arr2[:, j, i], arr3[:, j, i])

Can you use mmap to load batches of data and then operating on them? Something like

mm1, mm2, mm3 = # setup memory mapping of data
arr1, arr2, arr3 = # initialize array of correct size
@views for j in axes(arr1, 3)  # the @views macro applies to the entire loop
    arr1 .= mm1[:, :, j]  # load data directly into arrays due to @views
    arr2 .= mm2[:, :, j]
    arr3 .= mm3[:, :, j]
    Threads.@threads for i in axes(arr1, 2)
        result[i, j] = do_work(arr1[:, i], arr2[:, i], arr3[:, i])  # @views macro applies here too.

It’s also quite important how do_work is implemented. If that is somehow inefficient, and especially if it allocates temporary memory, then threads scaling will be poor. Optimizing your single-threaded performance is normally the first step towards getting good parallel performance.

Thanks for the suggestion of @views. I agree that optimising the single-threaded performance is the first step, and I think in this case that was already done to a good extent. (And in any case is a separate question from this.) What I was really after was a good procedure to deal with these kinds of data accesses, ideally with out of memory data being handled in a parallel way.

Any tips on using Mmap with threads? Will this lead to contention in reading the file if many threads are accessing different parts of the file at the same time? Is there a better file format supported by Julia for this kind of parallel I/O?

I have no experience with that. But if each dimension is long, and do_work does some significant work, then threading the inner loop may be quite beneficial anyway.