Is the linear system solver \ also multi threaded in Julia as in Matlab? And how to “multithread” it in Julia?

I am trying to compare speed and performance between Matlab and Julia. I am looking at a code that does topology optimization of a continuum structure subjected to a given load. The code I am looking at is the public code topopt88.m: https://www.topopt.mek.dtu.dk/Apps-and-software/Efficient-topology-optimization-in-MATLAB

Essentially it is an iterative algorithm where in every iteration a Ax=b system is solved (x=A\b), where the sparse matrix A depends on the structural design (it is the finite element stiffness matrix) and it is updated in every iteration.

In Julia the same code runs slower than Matlab. I have done some code optimization in Julia, declaring types in function definitions, using functions as much as possible, avoiding global variables, and implementing other tips I found in the internet. But Julia is still slower than the same Matlab code (same in the sense of conceptual steps).

My question: since Matlab system solve “” is multi threaded by default, is it true the same for Julia? If not, how to multi thread Julia’s \ operator, or to get speed-ups from parallelization similarly?

P.S.
The following is the current best solution that I found and that gives the best performance to me. Outside Julia in a terminal in Windows type set JULIA_NUM_THREADS=4 . Start Julia and use the Linear Algebra module this way: using LinearAlgebra BLAS.set_num_threads(1)

In this way there will be 4 threads, and by setting the BLAS threads to 1 it basically will not use its own pool that seems to conflict with Julia threads pool, or at least to slow the performance for me. I got some hint in this regard here: Julia Threads vs BLAS threads My current BLAS setting in Julia 1.5.1 is julia> BLAS.vendor() :openblas64

Using Pardiso.jl with the default MKL sparse solver didn’t help. I still need to figure out if there is a way to increase performance there too and how.

1 Like

Any chance you can post the code you’re using in Julia? Functions like \ have dozens of different possible methods that will be called depending on the type of arguments they receive. This it is much more predictive to talk about the performance of a specific piece of code than an algorithm in general.

1 Like

The full code is a bit too long I feel, but essentially the heart of the code is a while loop until convergence of the design update. In every while loop iteration the following system is solved:

#FE-ANALYSIS
sK = reshape(KE[:]*(Emin.+xPhys[:]'.^penal*(E0-Emin)),64*nelx*nely)
K = sparse(iK,jK,sK)
K = (K+K')/2
U[freedofs] = K[freedofs,freedofs]\F 

where freedofs are indices of rows and columns to be selected for the analysis. K is sparse, symmetric, and positive definite hence invertible; F an U are normal arrays.

Does this help?

You should probably try benchmarking your code, but it could be that the way you are using the sparse matrix is non-optimal… See here: https://docs.julialang.org/en/v1/stdlib/SparseArrays/
Note specially the following:

Indexing operations, especially assignment, are expensive, when carried out one element at a time. In many cases it may be better to convert the sparse matrix into (I,J,V) format using findnz , manipulate the values or the structure in the dense vectors (I,J,V) , and then reconstruct the sparse matrix.

2 Likes

Try

Symmetric(K[freedofs, freedofs]) \ F

even though the matrix is analytically symmetric, it may not be so numerically. You can check with issymetric(K).

2 Likes

Thank you @kristoffer.carlsson. The symmetry is ensured numerically (and not only analytically) by doing:

K = (K+K')/2

Do you think that Julia would prefer the Symmetric command? Doesn’t this slow down things more? I will try anyway, thanks.

2 Likes

Even if a matrix is numerically symmetric, telling Julia that often gives a 2x speedup.

1 Like

I removed
K = (K+K')/2 ,
and replaced
U[freedofs] = K[freedofs,freedofs]\F
with
U[freedofs] = Symmetric(K[freedofs,freedofs])\F
and gained in performance. Thanks!
I still think that there might be more to it. I am setting the number of threads outside Julia as 4
set JULIA_NUM_THREADS=4,
and in the script I set
using LinearAlgebra
BLAS.set_num_threads(1)
and in this way I get the best performance using openblas.
I feel that mutithreading and sparse matrix definition could be pushed more to gain more in computational speed.

Does your matrix have banded it blocked-banded structure? If so, using that could easily give you a factor of 10 speedup. Also, those structures are much easier to parallelize than CSC

The benchmark:

BenchmarkTools.Trial:
  memory estimate:  16.82 GiB
  allocs estimate:  29092
  --------------
  minimum time:     18.681 s (5.05% GC)
  median time:      18.681 s (5.05% GC)
  mean time:        18.681 s (5.05% GC)
  maximum time:     18.681 s (5.05% GC)
  --------------
  samples:          1
  evals/sample:     1

I was thinking more of benchmarking each individual call in your sample code, like the cost of

K = sparse(iK,jK,sK)
K = (K+K')/2

and

K[freedofs,freedofs]

and then compare that with the actual time for solving the system.

For large symmetric positive definite matrix you are also likely better of to use an interative solver like conjugate gradient. Would be a good idea to try that out:

3 Likes

How big is the matrix?

I have a mesh of 300x100 elements in 2D, with 2 degrees of freedom per node, which gives around 60800 dofs.

Then the matrix must be extremely dense, for some reason. I solve 2D heat conduction models with 1 million dofs in ~5seconds. Try to find the number of nonzeros, total and per row.

Is there a better way to impose boundary conditions in Julia than writing K[freedofs,freedofs]? Maybe this eats some time…

I do precisely this in Elfel.jl. No problem, it only takes about 0.1 sec.

OK thank you very much

I will try

Hey, just in case: check the type of the matrix when you get to solve. To see if it did not get converted to a dense matrix at some point…

1 Like