Hello,
I’m trying to do parallel matrix multiplication in a for loop. I’m using Threads.@threads, but this also applies to a GPU version of the code and maybe a distributed (MPI-like) version in the future.
Within my code I’m building a large matrix A that will be used for a bunch of stuff later. But I also build a large matrix B which is multiplied by a know vector v, resulting in vector b, so holding that large matrix in memory is wasteful. All I need is b.
I can think of a few ways of doing this:
- Build
Balong withA, at the end computeb=B*v: bad for memory management. - Parallelize building
Aand build rows ofBin each thread, multiplying them byvwithin the for loop and getting pieces ofb: not super elegant, not great if I want to do 2D parallelization on the GPU. Also not great because computingAby columns has fewer calculations than by rows. - Build
Aand for each position add to elements ofbusing atomics: I’m hesitant to do this as the documentation for CUDA.jl and Julia in general say that atomics are a work in progress and are kind of slow.
Am I missing a nicer way to do this, or are these the options I have, without having to code something really fancy?
Thanks a lot!