A common task in many programs is to multiply two large matrices, namely, A = B *C. Is there a way to do it in a way that utlizes all the processors ? This is a task that is theoretically, very easy to parallelize over any number of threads.
Open up your resource manager and see the usage. Notice that this is using OpenBLAS (or MKL) under the hood, which is already multithreading. The number of threads matches the number of physical cores by default. You can override this with
using LinearAlgebra
BLAS.set_num_threads(i)
For distributed computations over multiple computers, you’ll need to use multiprocessing. For that, the simplest way to get a usable solution is to make A and B be MPIArrays from MPIArrays.jl:
Then A*B between two MPI arrays will be parallelized via MPI. And to complete this response, you can use GPUArrays.jl/CuArrays.jl to do
Yes, there is the caveat that if you have more than 16 physical cores the default will be capped to 16, so you’ll need to build Julia from source. I assume that’s not a very standard concern.
Ooooh… I might have access to an ARM machine with 50 cores. I might not.
Might be fun to see this behaviour.
Actually I did try to build Julia from surce on that machine and it might have fork bombed ….
Not sure if it was Julia or the CEPH I was trying to compile at the same time.