Nice way to do parallel matrix multiplication

I’m trying to do parallel matrix multiplication in a for loop. I’m using Threads.@threads, but this also applies to a GPU version of the code and maybe a distributed (MPI-like) version in the future.
Within my code I’m building a large matrix A that will be used for a bunch of stuff later. But I also build a large matrix B which is multiplied by a know vector v, resulting in vector b, so holding that large matrix in memory is wasteful. All I need is b.
I can think of a few ways of doing this:

  • Build B along with A, at the end compute b=B*v: bad for memory management.
  • Parallelize building A and build rows of B in each thread, multiplying them by v within the for loop and getting pieces of b: not super elegant, not great if I want to do 2D parallelization on the GPU. Also not great because computing A by columns has fewer calculations than by rows.
  • Build A and for each position add to elements of b using atomics: I’m hesitant to do this as the documentation for CUDA.jl and Julia in general say that atomics are a work in progress and are kind of slow.

Am I missing a nicer way to do this, or are these the options I have, without having to code something really fancy?
Thanks a lot!