Efficient ways to implement a (distributed) matrix-matrix product?

There are a huge sparse matrix A and a tall-and-skinny matrix B (a set of multiple vectors). I want partition them into pieces for parallel distributed-memory matrix-matrix (matrix-vectors) multiplication. Is there any well-known and efficient ways to do this in Julia?

One candidate might be using Distributed.jl and DistributedArrays.jl, but it seems this way does not work well and I cannot get the correct scaling (see my another post quetions on matrix-matrix products using DistributedArrays).

Another candidate might be using PartitionedArrays.jl. But I am actually new to Julia and mpi, the test codes in PartitionedArrays.jl are overwhelming for me to read. What’s more, the authors posted an issue on asking help for implementing matrix-matrix products (see issues waiting for help in PartitionedArrays.jl). So I wonder if it is difficult or not to implement a distributed matrix-vectors product using the package.

Could anyone offer me any advice? Or even an example?

I have the similar question, but about dense matrices, not sparse ones.