Parallel sparse matrix vector product

Hello, guys!
Matrix-matrix and matrix-vector product operations are the core of linear algebra.
For dense matrices, the underlying implementation uses OpenBLAS and it hugely benefits from using multithreading.

For sparse matrices we have only sequential Julia functions, implementing matmat or matvec.
Are there any plans to develop parallel (multithreaded/multiworker) matmat and matvec operations for sparse matrices?
I believe it is an important part and should be included into stdlib/sparse.

Is there some kind of recommended replacement to stick to in the mean-time?

Which kind of parallel? SMP or DMP? Did you check out DistributedArrays?

MKL has parallel matvec.

It seems, that in the end we need a combination of both:
Arrays, which store the data for the sparse matrix, should be distributed across available nodes, but should be shared across the different cores in the same node.
You could say that we only need to distribute everything beforehand, but, as I understand, for the cores on the same node there is no substantial overhead in making memory shared, and it can potentially diminish amount of overhead connected to information exchange between the cores of the same node.

Is it possible to make the chunks of DistributedArray be SharedArrays?

Does MKL support distributed computations?

Is it optimised for non-uniform memory access?