Multithreading with sparse arrays

I am solving a 2D earthquake rupture dynamics problem using a spectral element method.

Over the last few months, I have translated a matlab code into Julia, made some changes in the serial part of the code (changing linear system solvers and changing the assemby the stiffness matrix) and gained like ~50 times speedup (special thanks to IterativeSolvers.jl and this community in general). I think I have reached the limit of serial performance.

As I move on to bigger problems (~ 20-30 million degrees of freedom), I feel I should do this parallel thing more seriously.

I recall that the default BLAS default multithreading doesn’t work with SparseArrays (I might be wrong on this).

The key parts I am looking to optimize are:
1. Linear System Solver (I use AlgebraicMultigrd.jl as preconditioner with the cg from IterativeSolvers.jl).
2. SparseMatrix and dense Vector multiplication

I have tried the following (which doesn’t give me a very good speedup on my machine with 4 cores):

import Base: eltype, size

struct ThreadedMul{Tv,Ti}

function LinearAlgebra.mul!(y::AbstractVector, M::ThreadedMul, x::AbstractVector)
    @threads for i = 1 : M.A.n
         _threaded_mul!(y, M.A, x, i)

@inline function _threaded_mul!(y, A::SparseMatrixCSC{Tv}, x, i) where {Tv}
    s = zero(Tv)
    @inbounds for j = A.colptr[i] : A.colptr[i + 1] - 1
        s += A.nzval[j] * x[A.rowval[j]]
    @inbounds y[i] = s

eltype(M::ThreadedMul) = eltype(M.A)
size(M::ThreadedMul, I...) = size(M.A, I...)

using AlgebraicMultigrid, IterativeSolvers

# Stiffness matrix = K is in SparseMatrixCSC format
# rhs = F (some dense vector)

ml = ruge_stuben(K)
p = aspreconditioner(ml)

K = ThreadedMul(K)

while time < tmax
    # 1. Preconditioned conjugate gradient
    cg!(d2, K, rhs, Pl=p, tol=1e-6)    # d2 is a dense vector with some initial constant value

    # 2. Matrix multiplication
    mul!(a, K, d2)

    # Some other boundary stuff
end while

Is this a correct implementation of multithreading? Should I also create a struct for rhs? K is very big but my computer can handle sparse matrix upto 30 million degrees of freedom, but the computations become super slow.

I know that the standard approach would be to use Trillinos or PetSc but I am not very good with those and I wonder if there is a native Julia way to work out the speedup. I also have access to large memory clusters but I don’t know if they would help.

I am also open to looking at other approaches to solve this (DistributedArrays, etc.)