I am trying to compare speed and performance between Matlab and Julia. I am looking at a code that does topology optimization of a continuum structure subjected to a given load. The code I am looking at is the public code topopt88.m: Efficient topology optimization in MATLAB using 88 lines of code - TopOpt
Essentially it is an iterative algorithm where in every iteration a Ax=b system is solved (x=A\b), where the sparse matrix A depends on the structural design (it is the finite element stiffness matrix) and it is updated in every iteration.
In Julia the same code runs slower than Matlab. I have done some code optimization in Julia, declaring types in function definitions, using functions as much as possible, avoiding global variables, and implementing other tips I found in the internet. But Julia is still slower than the same Matlab code (same in the sense of conceptual steps).
My question: since Matlab system solve "" is multi threaded by default, is it true the same for Julia? If not, how to multi thread Julia’s \ operator, or to get speed-ups from parallelization similarly?
P.S.
The following is the current best solution that I found and that gives the best performance to me. Outside Julia in a terminal in Windows type set JULIA_NUM_THREADS=4 . Start Julia and use the Linear Algebra module this way: using LinearAlgebraBLAS.set_num_threads(1)
In this way there will be 4 threads, and by setting the BLAS threads to 1 it basically will not use its own pool that seems to conflict with Julia threads pool, or at least to slow the performance for me. I got some hint in this regard here: Julia Threads vs BLAS threads - #15 by alkorang My current BLAS setting in Julia 1.5.1 is julia> BLAS.vendor():openblas64
Using Pardiso.jl with the default MKL sparse solver didn’t help. I still need to figure out if there is a way to increase performance there too and how.
Any chance you can post the code you’re using in Julia? Functions like \ have dozens of different possible methods that will be called depending on the type of arguments they receive. This it is much more predictive to talk about the performance of a specific piece of code than an algorithm in general.
The full code is a bit too long I feel, but essentially the heart of the code is a while loop until convergence of the design update. In every while loop iteration the following system is solved:
#FE-ANALYSIS
sK = reshape(KE[:]*(Emin.+xPhys[:]'.^penal*(E0-Emin)),64*nelx*nely)
K = sparse(iK,jK,sK)
K = (K+K')/2
U[freedofs] = K[freedofs,freedofs]\F
where freedofs are indices of rows and columns to be selected for the analysis. K is sparse, symmetric, and positive definite hence invertible; F an U are normal arrays.
You should probably try benchmarking your code, but it could be that the way you are using the sparse matrix is non-optimal… See here: Sparse Arrays · The Julia Language
Note specially the following:
Indexing operations, especially assignment, are expensive, when carried out one element at a time. In many cases it may be better to convert the sparse matrix into (I,J,V) format using findnz, manipulate the values or the structure in the dense vectors (I,J,V) , and then reconstruct the sparse matrix.
I removed K = (K+K')/2 ,
and replaced U[freedofs] = K[freedofs,freedofs]\F
with U[freedofs] = Symmetric(K[freedofs,freedofs])\F
and gained in performance. Thanks!
I still think that there might be more to it. I am setting the number of threads outside Julia as 4 set JULIA_NUM_THREADS=4,
and in the script I set using LinearAlgebra BLAS.set_num_threads(1)
and in this way I get the best performance using openblas.
I feel that mutithreading and sparse matrix definition could be pushed more to gain more in computational speed.
Does your matrix have banded it blocked-banded structure? If so, using that could easily give you a factor of 10 speedup. Also, those structures are much easier to parallelize than CSC
For large symmetric positive definite matrix you are also likely better of to use an interative solver like conjugate gradient. Would be a good idea to try that out:
Then the matrix must be extremely dense, for some reason. I solve 2D heat conduction models with 1 million dofs in ~5seconds. Try to find the number of nonzeros, total and per row.