I am solving a 2D earthquake rupture dynamics problem using a spectral element method.

Over the last few months, I have translated a matlab code into Julia, made some changes in the serial part of the code (changing linear system solvers and changing the assemby the stiffness matrix) and gained like ~50 times speedup (special thanks to IterativeSolvers.jl and this community in general). I think I have reached the limit of serial performance.

As I move on to bigger problems (~ 20-30 million degrees of freedom), I feel I should do this parallel thing more seriously.

I recall that the default BLAS default multithreading doesn’t work with SparseArrays (I might be wrong on this).

The key parts I am looking to optimize are:

`1. Linear System Solver (I use AlgebraicMultigrd.jl as preconditioner with the cg from IterativeSolvers.jl). `

`2. SparseMatrix and dense Vector multiplication`

I have tried the following (which doesn’t give me a very good speedup on my machine with 4 cores):

```
import Base: eltype, size
struct ThreadedMul{Tv,Ti}
A::SparseMatrixCSC{Tv,Ti}
end
function LinearAlgebra.mul!(y::AbstractVector, M::ThreadedMul, x::AbstractVector)
@threads for i = 1 : M.A.n
_threaded_mul!(y, M.A, x, i)
end
y
end
@inline function _threaded_mul!(y, A::SparseMatrixCSC{Tv}, x, i) where {Tv}
s = zero(Tv)
@inbounds for j = A.colptr[i] : A.colptr[i + 1] - 1
s += A.nzval[j] * x[A.rowval[j]]
end
@inbounds y[i] = s
y
end
eltype(M::ThreadedMul) = eltype(M.A)
size(M::ThreadedMul, I...) = size(M.A, I...)
using AlgebraicMultigrid, IterativeSolvers
# Stiffness matrix = K is in SparseMatrixCSC format
# rhs = F (some dense vector)
ml = ruge_stuben(K)
p = aspreconditioner(ml)
K = ThreadedMul(K)
while time < tmax
# 1. Preconditioned conjugate gradient
cg!(d2, K, rhs, Pl=p, tol=1e-6) # d2 is a dense vector with some initial constant value
# 2. Matrix multiplication
mul!(a, K, d2)
# Some other boundary stuff
end while
```

Is this a correct implementation of multithreading? Should I also create a struct for rhs? K is very big but my computer can handle sparse matrix upto 30 million degrees of freedom, but the computations become super slow.

I know that the standard approach would be to use Trillinos or PetSc but I am not very good with those and I wonder if there is a native Julia way to work out the speedup. I also have access to large memory clusters but I don’t know if they would help.

I am also open to looking at other approaches to solve this (DistributedArrays, etc.)

Thanks,