Hello,
I’m trying to do parallel matrix multiplication in a for loop. I’m using Threads.@threads
, but this also applies to a GPU version of the code and maybe a distributed (MPI-like) version in the future.
Within my code I’m building a large matrix A
that will be used for a bunch of stuff later. But I also build a large matrix B
which is multiplied by a know vector v
, resulting in vector b
, so holding that large matrix in memory is wasteful. All I need is b
.
I can think of a few ways of doing this:
- Build
B
along withA
, at the end computeb=B*v
: bad for memory management. - Parallelize building
A
and build rows ofB
in each thread, multiplying them byv
within the for loop and getting pieces ofb
: not super elegant, not great if I want to do 2D parallelization on the GPU. Also not great because computingA
by columns has fewer calculations than by rows. - Build
A
and for each position add to elements ofb
using atomics: I’m hesitant to do this as the documentation for CUDA.jl and Julia in general say that atomics are a work in progress and are kind of slow.
Am I missing a nicer way to do this, or are these the options I have, without having to code something really fancy?
Thanks a lot!