A = B*C. Try it!
A = rand(3000,3000); B = rand(3000,3000)
Open up your resource manager and see the usage. Notice that this is using OpenBLAS (or MKL) under the hood, which is already multithreading. The number of threads matches the number of physical cores by default. You can override this with
For distributed computations over multiple computers, you’ll need to use multiprocessing. For that, the simplest way to get a usable solution is to make
B be MPIArrays from MPIArrays.jl:
A*B between two MPI arrays will be parallelized via MPI. And to complete this response, you can use GPUArrays.jl/CuArrays.jl to do
_A = cu(A); _B = cu(B)
to parallelize it on the GPU.