Ccall Fortran OpenBLAS multithreading issues with Julia 1.7

Perrin_Meyer · December 10, 2021, 9:53pm

I am using Julia to drive a Fortran based Boundary Element code. The Fortran code uses OpenMP and OpenBLAS for multithreading.

When I call into the BEM code from Julia 1.7, I now only get single-threaded performance.

On my previous computer with Julia 1.6, I was able to use CCALL to call into the Fortran code and get the same multithreading performance as when I compiled a stand-alone Fortan driver/executable. I was not setting any environment variables, it just “worked”.

I know its an OpenBLAS interaction, since If I add

using MKL

to the top of my Julia driver, then when I CCALL into the Fortran code I get the same multithreading performance as a stand-alone Fortran driver.

what are the magic incantations (environment variables?..) to tell my Fortran/OpenBLAS code to use all 48 threads (in this case, Julia is only the driver, so I don’t care if Julia is single threaded, although it would be great if both could be multithreaded…

When I compile OpenBLAS for the Fortran BEM library, I use

make USE_OPENMP=1 DYNAMIC_ARCH=1 NUM_THREADS=48

Is this the correct incantation for compiling OpenBLAS to “play nice” with Julia Multithreading?

In theory, should I try to link my Fortran code to the same OpenBLAS library Julia uses? or is that a bad idea?

Thanks

Perrin_Meyer · December 11, 2021, 3:53am

Or maybe

using MKL

sets the Environment variables and threading parameters “correctly”, so its not necessarily an OpenBLAS interaction…

googling “openBLAS” and Julia brings up the fascinating discussion of libblastrampoline, which gives me the impression that OpenMP and OpenBLAS and BLAS libraries in general are really messy (especially trying support multiple hardware/OS combinations).

So any advice or pointers appreciated…

carstenbauer · December 11, 2021, 6:35am

Try setting OMP_NUM_THREADS=48.

Perrin_Meyer · December 11, 2021, 11:54pm

unfortunately, neither
ENV["OMP_NUM_THREADS"]=48
at the top of the Julia driver nor
export OMP_NUM_THREADS=48
in the shell
restore multi-threading performance
However, putting
using MKL
at the top of the Julia driver code does restore multithreading (top goes from 99% cpu utilization to 3980% utilization during the mostly-embarrassingly-parallel quadrature phase of the BEM code. )

Any other suggestions for debugging this greatly appreciated. Using MKL is certainly a fine stopgap solution, but I’d like to figure out how to make OpenBLAS/OpenMP happy if possible.

thanks

Ralph_Smith · December 11, 2021, 11:55pm

If your Fortran library only uses the standard BLAS/LAPACK API (and not OpenBLAS internals) you should be able to build it so that it simply links to the libblastrampoline.so in your Julia system instead of OpenBLAS. Then threading should be handled consistently and (fairly) safely, and there would be other advantages. Let’s ask @staticfloat if there are known problems with this approach.

staticfloat · December 12, 2021, 1:49pm

Linking directly against LBT is fine, but it should mostly help with being BLAS vendor independent, more than working well with threads.

Can you try BLAS.set_num_threads() and see if changing that does anything? I wonder if they version of OpenBLAS we ship in 1.7 has different multi threading thresholds and just doesn’t think your problem is worth splitting up into chunks. Can you try invoking a giant square GEMM or something like that just to ensure that it’s not due to OpenBLAS not wanting to multithread a really skinny matrix?

Perrin_Meyer · December 13, 2021, 7:48pm

BLAS.set_num_threads(48)
does not help.

For this Fortran driver, I’m not concerned about Julia BLAS multi-threading, its just that when I ccall() into the BEM fortran library, I get only single threading performance in the Fortran library, unless I put
using MKL
at the top of my Julia driver file, in which case I get the same OpenMP/OpenBLAS multithreading performance as when I compile a stand-alone Fortran executable.

Since I’m linking the Fortran library to a OpenBLAS library I’m compiling from source, the only thing I can think is that the Julia OpenBLAS/OpenMP implementation is fighting with the Fortran OpenBLAS/OpenMP threading…

staticfloat · December 14, 2021, 4:06pm

Can you try invoking a giant square GEMM or something like that just to ensure that it’s not due to OpenBLAS not wanting to multithread a really skinny matrix?

Topic		Replies	Views
BLAS fails in Julia's multithreaded mode with too many threads General Usage question , blas , hpc	4	1365	February 15, 2017
Julia Threads vs BLAS threads Internals & Design	16	10955	July 26, 2018
Regarding the multithreaded performance of OpenBLAS Performance blas , multithreading	7	5428	January 31, 2022
BLAS thread count vs Julia thread count General Usage question , performance , linearalgebra	21	2734	April 6, 2021
How to prevent BLAS from thrashing with Julia? General Usage parallel	5	2188	May 30, 2017

Ccall Fortran OpenBLAS multithreading issues with Julia 1.7

Related topics