Slowdown vs C++/Python for Eigen matrix multiplication via ccall

I’m hoping someone can explain the performance degradation that I’m seeing with Eigen matrix multiplication (and other methods in the Eigen C++ library).

I have a simple example here:

The C++ file has a simple example of a function that multiplies two 3000x3000 Eigen matrices together (with the BLAS preprocessor directive EIGEN_USE_BLAS), which is compiled into a library libccall.so as well as an executable ccall_ex. The library surfaces the function (matmulcpp) for use by Python and Julia.

void matmulcpp() {
    auto t0 = std::chrono::high_resolution_clock::now();
    const int N = 3000;
    Eigen::MatrixXd A = Eigen::MatrixXd::Random(N, N);
    Eigen::MatrixXd B = Eigen::MatrixXd::Random(N, N);
    Eigen::MatrixXd C = A * B;
    auto t1 = std::chrono::high_resolution_clock::now();
    double elapsed_ms = duration_cast<std::chrono::microseconds>(t1 - t0).count() / 1000.0;
    std::cout << "Elapsed: " << elapsed_ms << " ms\n";
}

Running the C++ executable outputs the timings (e.g.):

> ./build/ccall_ex
Elapsed: 604.718 ms
Elapsed: 573.624 ms
Elapsed: 585.415 ms
Elapsed: 559.289 ms
Elapsed: 576.903 ms

I get similar performance calling the function from Python using ctypes (e.g.):

> python3 ccall.py
Elapsed: 588.029 ms
Elapsed: 600.39 ms
Elapsed: 580.682 ms
Elapsed: 615.011 ms
Elapsed: 585.849 ms

But in Julia, using ccall, I get significantly worse performance (e.g.):

> julia ccall.jl
Elapsed: 1437.66 ms
Elapsed: 1405.58 ms
Elapsed: 1389.82 ms
Elapsed: 1394.7 ms
Elapsed: 1395.29 ms

Can anyone explain why I’m seeing this performance degradation? Thanks for the help!

Are you using the same BLAS library? The same number of BLAS threads?

That’s all that really matters for large matrix multiplication; the language impact should be tiny because all the work is done in the BLAS library. We see these kinds of posts over and over, and the differences invariably come down to the BLAS configuration.

Ah yes, if I set OPENBLAS_NUM_THREADS explicitly, then I get similar performance across all three. I suppose without setting this environmental variable then Julia defaults to 1.

Thanks!