Ccall Fortran OpenBLAS multithreading issues with Julia 1.7

I am using Julia to drive a Fortran based Boundary Element code. The Fortran code uses OpenMP and OpenBLAS for multithreading.

When I call into the BEM code from Julia 1.7, I now only get single-threaded performance.

On my previous computer with Julia 1.6, I was able to use CCALL to call into the Fortran code and get the same multithreading performance as when I compiled a stand-alone Fortan driver/executable. I was not setting any environment variables, it just “worked”.

I know its an OpenBLAS interaction, since If I add

using MKL

to the top of my Julia driver, then when I CCALL into the Fortran code I get the same multithreading performance as a stand-alone Fortran driver.

what are the magic incantations (environment variables?..) to tell my Fortran/OpenBLAS code to use all 48 threads (in this case, Julia is only the driver, so I don’t care if Julia is single threaded, although it would be great if both could be multithreaded…

When I compile OpenBLAS for the Fortran BEM library, I use

make USE_OPENMP=1 DYNAMIC_ARCH=1 NUM_THREADS=48

Is this the correct incantation for compiling OpenBLAS to “play nice” with Julia Multithreading?

In theory, should I try to link my Fortran code to the same OpenBLAS library Julia uses? or is that a bad idea?

Thanks

1 Like

Or maybe

using MKL

sets the Environment variables and threading parameters “correctly”, so its not necessarily an OpenBLAS interaction…

googling “openBLAS” and Julia brings up the fascinating discussion of libblastrampoline, which gives me the impression that OpenMP and OpenBLAS and BLAS libraries in general are really messy (especially trying support multiple hardware/OS combinations).

So any advice or pointers appreciated…

Try setting OMP_NUM_THREADS=48.

unfortunately, neither
ENV["OMP_NUM_THREADS"]=48
at the top of the Julia driver nor
export OMP_NUM_THREADS=48
in the shell
restore multi-threading performance
However, putting
using MKL
at the top of the Julia driver code does restore multithreading (top goes from 99% cpu utilization to 3980% utilization during the mostly-embarrassingly-parallel quadrature phase of the BEM code. )

Any other suggestions for debugging this greatly appreciated. Using MKL is certainly a fine stopgap solution, but I’d like to figure out how to make OpenBLAS/OpenMP happy if possible.

thanks

If your Fortran library only uses the standard BLAS/LAPACK API (and not OpenBLAS internals) you should be able to build it so that it simply links to the libblastrampoline.so in your Julia system instead of OpenBLAS. Then threading should be handled consistently and (fairly) safely, and there would be other advantages. Let’s ask @staticfloat if there are known problems with this approach.

Linking directly against LBT is fine, but it should mostly help with being BLAS vendor independent, more than working well with threads.

Can you try BLAS.set_num_threads() and see if changing that does anything? I wonder if they version of OpenBLAS we ship in 1.7 has different multi threading thresholds and just doesn’t think your problem is worth splitting up into chunks. Can you try invoking a giant square GEMM or something like that just to ensure that it’s not due to OpenBLAS not wanting to multithread a really skinny matrix?

BLAS.set_num_threads(48)
does not help.

For this Fortran driver, I’m not concerned about Julia BLAS multi-threading, its just that when I ccall() into the BEM fortran library, I get only single threading performance in the Fortran library, unless I put
using MKL
at the top of my Julia driver file, in which case I get the same OpenMP/OpenBLAS multithreading performance as when I compile a stand-alone Fortran executable.

Since I’m linking the Fortran library to a OpenBLAS library I’m compiling from source, the only thing I can think is that the Julia OpenBLAS/OpenMP implementation is fighting with the Fortran OpenBLAS/OpenMP threading…

Can you try invoking a giant square GEMM or something like that just to ensure that it’s not due to OpenBLAS not wanting to multithread a really skinny matrix?