Combining multi-processing and multi-threading

I have a complex numerical simulation which I want to parallelise on large machines. I struggle a bit with the optimal parallelisation strategy. And I am hoping to get some hints what to try/avoid from the community.

To understand Julia’s multi-processing and multi-threading capabilities, I use a simplified workload which calculates exp.(X*Y) for random square matrices X and Y (of dimension 3072).

My multi-threading implementation of that workload is

function mat_mult_multi_threaded!(Z, X, Y)
    Threads.@threads :static for iter in 0:prod(size(Z))-1
        i = (iter % size(Z, 1)) + 1
        j = (iter ÷ size(Z, 1)) + 1
        z = 0.0
        for k in axes(X, 2)
            z += X[i, k] * Y[k, j]
        end
        Z[i, j] = exp(z)
    end
end

My multi-processing implementation of the workload is

@everywhere function mat_mult_chunk!(Z, chunk, X, Y)
    W = zeros(length(chunk)) # work
    for (idx, iter) in enumerate(chunk)
        i = (iter % size(Z, 1)) + 1
        j = (iter ÷ size(Z, 1)) + 1
        z = 0.0
        for k in axes(X, 2)
            z += X[i, k] * Y[k, j]
        end
        W[idx] = exp(z)
    end
    for (idx, iter) in enumerate(chunk)
        i = (iter % size(Z, 1)) + 1
        j = (iter ÷ size(Z, 1)) + 1
        Z[i, j] = W[idx]
    end
end

function mat_mult_distributed!(Z, X, Y)
    n_iters = prod(size(Z))
    n_workers = nworkers()
    chunk_size = n_iters ÷ n_workers
    if n_iters % n_workers > 0
        chunk_size += 1
    end
    idx_chunks = Iterators.partition(0:n_iters-1, chunk_size)
    Z_ = SharedArray{Float64}(size(Z))
    @sync for (chunk, pid) in zip(idx_chunks, workers())
        @async remotecall_fetch(mat_mult_chunk!, pid, Z_, chunk, X, Y)
    end
    Z .= Z_
end

I run the calculations on an AWS hpc6a.48xlarge instance. The machine is supposed to have 96 vCPU’s (without hyperthreading).

Now, I see the following run times:

matmult

1/ I find it surprising that multi-processing is ~10% faster than multi-threading for small number of threads/processes. Any thoughts why this is the case?

2/ Scaling of multi-processing deteriorates for more than 24 processes. Is this an expected behaviour because of the increased overhead from communication?

3/ Suppose, I want to exploit the better performance of multi-processing for small number of processes with scaling from multi-threading. Would it be a viable approach to add multi-threading to mat_mult_chunk! which is called via multi-processing?

Any comments and suggestios are appreciated.

1 Like

I’d guess 2/ depends on memory bandwidth and cache size and other architectural details. The short loop, one or two instructions of computation and two memory acesses, will create considerable memory pressure. This will in general hurt performance when the number of threads increases. In addition there could be communication, as you suggest.

For 1/ I don’t know. It could be some overhead in the creation of threads, your inner loop uses only a few thousand instructions, and it could be some hundred instructions involved in the thread management. But I’m not sure. It could also be that threads are scheduled differently than processes (though in e.g. linux there is typically no difference).

For 3/ If there is a memory bandwidth limitation, you will not gain much by combining threads and multi-processes. It might be a better idea to alleviate the memory pressure by blocking the matrix operations. In general use linear algebra libraries, either via the LinearAlgebra package, or the BLAS directly. These libraries are highly optimized for memory accesses, which can yield substantial performance improvement compared to plain translation of formulas into code.

2 Likes

Since you didn’t post the scripts that run your functions, here a some basic questions:

  • Did you use ThreadPinning.jl?
  • Did you set the number of BLAS threads to 1 to disable threading there? Or check that it is 1 by default? If it is larger than 1 by default, then that could explain why the multi-process curve is starts below the threaded curve (uses more cores) and ends up above (oversubscription)
  • what exactly is measured here? Only execution time of the computation? Including or excluding compilation time?
4 Likes

Thanks a lot for your feedback!

Re using vectorised and BLAS routines: I Agree and I do use it in my actual simulation. But then I also get quite a lot allocations. And garbage collection seems to slow things down with multi-threading. (But that s a different challenge.)

Re ThreadPinning.jl: Thanks a lot for pointing me to this tool. This looks quite helpful. Will try this out.

Re BLAS threads: Yes, I do set export OMP_NUM_THREADS=1 and export OPENBLAS_NUM_THREADS=1. This should avoid oversubscription.

Re time measuremnt: I exclude compilation time from time measurement by including a warm-up call that triggers compilation. Then, I do the actual run and measure run time via @elapsed. My understanding is, that this measures run time which is what I am interested in.

Here is also the script which I use to run the workload:

MatMult.jl (8.6 KB)

Again, thanks for your help!

This should be fine. I checked that things run by run inherit the environment. To be extra sure you could just do

@everywhere begin
    using LinearAlgebra: BLAS
    BLAS.set_num_threads(1)
end
1 Like

This is more of a general comment, perhaps redundant with what you already know. You can go lower level in the BLAS calls and try to use the in-place operations (mul! instead of mul or *, for example). Then you will have to preallocate the resulting matrices. For paralellization, it is better that you preallocate once per task, where a task comprises a chunk of many operations, and parallelization is done at the chunk level. Also, depending on your problem it might be useful even to copy for each task the input arrays of the operations, such that they do not share the memory accesses.

OhMyThreads.jl and, a bit lower level, ChunkSplitters.jl, are packages that facilitate all that.

1 Like

Thanks for your comments. I would like to follow-up on two points.

I understand that this is important for multi-processing. But are you saying this should be done for multi-threading as well?

I saw OhMyThreads.jl a while ago. But I am not sure I understand the purpose. Is it an alternative implementation of Base.Threads? Or is it an interface to simplify multi-thraded calculation?

It may be important. If the computation itself involves many accesses to shared arrays, the parallelization may be impaired by the concurrent access of the same memory region. If the cost of copying the data is not important (as per task/chunk), that may improve the scaling.

It is an interface to simplify the calculations. One thing it does, under the hood, is to split the workload in chunks, and there are interfaces to facilitate the creation of task-local arrays (as many as the number of chunks).

edit: The documentations of OhMyThreads.jl and ChunkSplitters.jl, contain very interesting discussions and insights into to optimal patterns for workload balancing and using shared buffers (contributions from Carsten Bauer).

4 Likes

Thanks again for pointing me to ThreadPinning.jl.

I repeated my calculations with pinthreads(:cores). This improved the multi-threading performance quite a bit:

matmult_w_pinning

In particular, I get better scaling for all available 96 CPUs.

Pinning the threads also helped for my complex simulation.

4 Likes