How to utilize all cores when operating on large matrices


I have a relatively large matrix (approximate size 50000x50000) that I need to bulk divide by numbers in a separate matrix of the same shape. The code does something simple like

V = A ./ B

It seems to utilize only a single core. Is there any way to make it use some number of cores to speed things up?

I tried setting JULIA_NUM_THREADS to 4 but it doesn’t seem to do anything.


I am not sure you will get any good speedup since an operation like this tends to be memory bounds.

You could perhaps play around with stuff like but that is only for multiplication, and from my testing, it seemed slower anyway.


IIUC (not an expert), this strongly depends on the architecture, and some machines have a memory bandwidth that scales with the number of cores.

edit: also the question is valid for more compute-bound kernels


See for threaded broadcast in particular.



For strongly memory bound problems (like your previous post on array sum) and on a classical Desktop you may have some speed-up (x2-4) if you have several (2-4) memory channels. It is usually not the case on a laptop. Actually a floating point division takes some cycles and multithreading could be interesting.

In any case, I understood (from the previous link) that multithreading is not implicit for broadcast (which is a good think because it is difficult to anticipate if your code will be nested in another multithreaded context).

A possible solution if you have a strong interest in defining A and B separately, would be to fuse the div operation with the algorithm that uses V. A lazy definition of V could be nice :wink:

The curve corresponding to a MT+simd version your sum:


And the corresponding snippet:

function mysum_simd_threads(a::Vector)
    total = zero(eltype(a))
    Threads.@threads for c=1:nchunk
        stotal = zero(eltype(a))
        @simd for i=imin:imax
            @inbounds stotal += a[i]
    for i=nchunk*chunksize+1:n
    return total