I’m trying to do parallel matrix multiplication in a for loop. I’m using
Threads.@threads, but this also applies to a GPU version of the code and maybe a distributed (MPI-like) version in the future.
Within my code I’m building a large matrix
A that will be used for a bunch of stuff later. But I also build a large matrix
B which is multiplied by a know vector
v, resulting in vector
b, so holding that large matrix in memory is wasteful. All I need is
I can think of a few ways of doing this:
A, at the end compute
b=B*v: bad for memory management.
- Parallelize building
Aand build rows of
Bin each thread, multiplying them by
vwithin the for loop and getting pieces of
b: not super elegant, not great if I want to do 2D parallelization on the GPU. Also not great because computing
Aby columns has fewer calculations than by rows.
Aand for each position add to elements of
busing atomics: I’m hesitant to do this as the documentation for CUDA.jl and Julia in general say that atomics are a work in progress and are kind of slow.
Am I missing a nicer way to do this, or are these the options I have, without having to code something really fancy?
Thanks a lot!