How do I use multithreaded BLAS in each MPI process

I would like to launch one MPI process on each node and perform multithreaded BLAS, the same as tested here, and discussed at combining-distributed-computing-multithreading.

Specifically, I am trying to use Elemental.jl:

using MPI, MPIClusterManagers, Distributed
using BenchmarkTools

np = 20
man = MPIManager(np = np)
println("Added procs $(procs())")

@everywhere using LinearAlgebra, Elemental, MPI

@mpi_do man begin
    A = Elemental.DistMatrix{Float64}
    B = Elemental.uniform(A, 50000, 50000)
    C = Elemental.uniform(A, 50000)
    D = Elemental.uniform(A, 50000)

@btime @mpi_do man begin
    mul!(D, B, C, 1.0, 0.0)

However, the example above is the worst kind of usage for launching as many MPI processes as the number of CPUs, which has way too much communication overhead.

I noticed that there is an option enable_threaded_blas in addprocs. However, I still don’t have any clue about how to glue them together. How may I achieve this, given the example code above, or use another library if possible in Julia’s ecosystem?

Thanks for your time and help!