I am solving a 2D earthquake rupture dynamics problem using a spectral element method.
Over the last few months, I have translated a matlab code into Julia, made some changes in the serial part of the code (changing linear system solvers and changing the assemby the stiffness matrix) and gained like ~50 times speedup (special thanks to IterativeSolvers.jl and this community in general). I think I have reached the limit of serial performance.
As I move on to bigger problems (~ 20-30 million degrees of freedom), I feel I should do this parallel thing more seriously.
I recall that the default BLAS default multithreading doesn’t work with SparseArrays (I might be wrong on this).
The key parts I am looking to optimize are:
1. Linear System Solver (I use AlgebraicMultigrd.jl as preconditioner with the cg from IterativeSolvers.jl).
2. SparseMatrix and dense Vector multiplication
I have tried the following (which doesn’t give me a very good speedup on my machine with 4 cores):
import Base: eltype, size
struct ThreadedMul{Tv,Ti}
A::SparseMatrixCSC{Tv,Ti}
end
function LinearAlgebra.mul!(y::AbstractVector, M::ThreadedMul, x::AbstractVector)
@threads for i = 1 : M.A.n
_threaded_mul!(y, M.A, x, i)
end
y
end
@inline function _threaded_mul!(y, A::SparseMatrixCSC{Tv}, x, i) where {Tv}
s = zero(Tv)
@inbounds for j = A.colptr[i] : A.colptr[i + 1] - 1
s += A.nzval[j] * x[A.rowval[j]]
end
@inbounds y[i] = s
y
end
eltype(M::ThreadedMul) = eltype(M.A)
size(M::ThreadedMul, I...) = size(M.A, I...)
using AlgebraicMultigrid, IterativeSolvers
# Stiffness matrix = K is in SparseMatrixCSC format
# rhs = F (some dense vector)
ml = ruge_stuben(K)
p = aspreconditioner(ml)
K = ThreadedMul(K)
while time < tmax
# 1. Preconditioned conjugate gradient
cg!(d2, K, rhs, Pl=p, tol=1e-6) # d2 is a dense vector with some initial constant value
# 2. Matrix multiplication
mul!(a, K, d2)
# Some other boundary stuff
end while
Is this a correct implementation of multithreading? Should I also create a struct for rhs? K is very big but my computer can handle sparse matrix upto 30 million degrees of freedom, but the computations become super slow.
I know that the standard approach would be to use Trillinos or PetSc but I am not very good with those and I wonder if there is a native Julia way to work out the speedup. I also have access to large memory clusters but I don’t know if they would help.
I am also open to looking at other approaches to solve this (DistributedArrays, etc.)
Thanks,