Parallel computation of multiplication of large matrices

A common task in many programs is to multiply two large matrices, namely, A = B *C. Is there a way to do it in a way that utlizes all the processors ? This is a task that is theoretically, very easy to parallelize over any number of threads.

I would appreciate any help on this.

Over threads? A = B*C. Try it!

A = rand(3000,3000); B = rand(3000,3000)
A*B

Open up your resource manager and see the usage. Notice that this is using OpenBLAS (or MKL) under the hood, which is already multithreading. The number of threads matches the number of physical cores by default. You can override this with

using LinearAlgebra
BLAS.set_num_threads(i)

For distributed computations over multiple computers, you’ll need to use multiprocessing. For that, the simplest way to get a usable solution is to make A and B be MPIArrays from MPIArrays.jl:

Then A*B between two MPI arrays will be parallelized via MPI. And to complete this response, you can use GPUArrays.jl/CuArrays.jl to do

_A = cu(A); _B = cu(B)
_A*_B

to parallelize it on the GPU.

13 Likes

I thought OpenBLAS is using max 16 threads? (at least for the precompiled Julia versions)

Yes, there is the caveat that if you have more than 16 physical cores the default will be capped to 16, so you’ll need to build Julia from source. I assume that’s not a very standard concern.

Thanks, that was very very enlightening

Ooooh… I might have access to an ARM machine with 50 cores. I might not.
Might be fun to see this behaviour.

Actually I did try to build Julia from surce on that machine and it might have fork bombed ….
Not sure if it was Julia or the CEPH I was trying to compile at the same time.