Nice way to do parallel matrix multiplication

Ribeiro · February 15, 2021, 12:14pm

Hello,
I’m trying to do parallel matrix multiplication in a for loop. I’m using Threads.@threads, but this also applies to a GPU version of the code and maybe a distributed (MPI-like) version in the future.
Within my code I’m building a large matrix A that will be used for a bunch of stuff later. But I also build a large matrix B which is multiplied by a know vector v, resulting in vector b, so holding that large matrix in memory is wasteful. All I need is b.
I can think of a few ways of doing this:

Build B along with A, at the end compute b=B*v: bad for memory management.
Parallelize building A and build rows of B in each thread, multiplying them by v within the for loop and getting pieces of b: not super elegant, not great if I want to do 2D parallelization on the GPU. Also not great because computing A by columns has fewer calculations than by rows.
Build A and for each position add to elements of b using atomics: I’m hesitant to do this as the documentation for CUDA.jl and Julia in general say that atomics are a work in progress and are kind of slow.

Am I missing a nicer way to do this, or are these the options I have, without having to code something really fancy?
Thanks a lot!

Topic		Replies	Views
Parallel computation of multiplication of large matrices Performance	5	2430	April 8, 2019
Best way to parallelize Julia at Scale parallel , linearalgebra , sparse	12	1339	August 25, 2022
Batched matrix multiplication in CUDA General Usage performance , cuda	8	813	November 30, 2023
Matrix multiplication is slower when multithreading in Julia Performance question , multithreading , linearalgebra	13	4155	January 21, 2022
Is there an easy way to parallelise matrix multiplication? Performance	8	4687	April 19, 2019

Nice way to do parallel matrix multiplication

Related topics