# Ccall performance

Hi,
I am doing a matrix-matrix multiplication `A(12870*11440)*B(11440*11440)` using `mkl_dgemm` library. In c language, The whole calculation takes about 20 sec but if I call the `mkl_dgemm` using Julia’s ccall it takes 42.274859 sec. Inbuilt `mul!(c,a,b)` do the same calculation in 21.586031 seconds. I can’t understand how to improve these extra 20 seconds. I am attaching a part of my code.

``````julia> @time mul!(c,a,b);
21.586031 seconds
julia> @time ccall(("dgemm", libmkl_rt), Cvoid,
(Ref{UInt8}, Ref{UInt8}, Ref{BlasInt}, Ref{BlasInt},
Ref{BlasInt}, Ref{Float64}, Ptr{Float64}, Ref{BlasInt},
Ptr{Float64}, Ref{BlasInt}, Ref{Float64}, Ptr{Float64},
Ref{BlasInt}),
'N', 'N', 12870, 11440, 11440, 1.0, a, 12870, b, 11440, 0.0, c, 12870)
42.274859 seconds (3 allocations: 48 bytes)
``````

As a general remark, adding the full code to reproduce an issue highly increases the chances someone will be willing to help you

In this case, I can’t reproduce your issue:

``````julia> using MKL_jll, LinearAlgebra

julia> import LinearAlgebra: BlasInt

julia> A = rand(12870, 11440);

julia> B = rand(11440, 11440);

julia> C = Matrix{Float64}(undef, 12870, 11440);

julia> @time mul!(C, A, B);
33.708089 seconds (2.70 M allocations: 129.942 MiB)

julia> @time ccall(("dgemm", libmkl_rt), Cvoid,
(Ref{UInt8}, Ref{UInt8}, Ref{BlasInt}, Ref{BlasInt},
Ref{BlasInt}, Ref{Float64}, Ptr{Float64}, Ref{BlasInt},
Ptr{Float64}, Ref{BlasInt}, Ref{Float64}, Ptr{Float64},
Ref{BlasInt}),
'N', 'N', 12870, 11440, 11440, 1.0, A, 12870, B, 11440, 0.0, C, 12870)
29.913322 seconds (67 allocations: 4.359 KiB)
``````

You didn’t provide the C code you used for you benchmark, I’m not going to spend time on writing it on my own

3 Likes

Ok
I am attaching the full code

``````#include<stdio.h>
#include<math.h>
#include<mkl.h>
#include<omp.h>
int main()
{
double *a, *b,*c;
MKL_INT i,k,m,n;
double alpha,beta;
double wtime;
alpha=1.0;
beta=0.0;
a=calloc(12870*11440,sizeof(double));
b=calloc(11440*11440,sizeof(double));
c=calloc(12870*11440,sizeof(double));
m=12870;
n=11440;
k=11440;
for (i=0;i<m*k;i++)
{
a[i]=0.005*log(i+1);
}
for(i=0;i<k*n;i++)
{
b[i]=sqrt(i)/sqrt(i+1);
}
wtime=omp_get_wtime();
dgemm("N","N",&m,&n,&k,&alpha,a,&m,b,&k,&beta,c,&m);
wtime=omp_get_wtime() - wtime;
printf("matrix mult took omp time=%lf\n",wtime);
free(a);
free(b);
free(c);
``````

my output is
matrix mult took omp time=13.502873
if I define omp_num_threads as 32
matrix mult took omp time=18.259101
and In Julia

``````julia> using LinearAlgebra,Libdl

julia> import LinearAlgebra: BlasInt
julia> const global libmkl_rt = Libdl.find_library(["libmkl_rt"], ["/opt/intel/mkl/lib"]);
julia> a = fill(0.0,(12870, 11440));
julia> b = fill(0.0,(11440, 11440));
julia> c = fill(0.0,(12870, 11440));
julia> for i=1:12870*11440
a[i]=0.005*log(i)
end
julia> for i=1:11440*11440
b[i]=sqrt(i-1)/sqrt(i)
end
julia> @time mul!(c,a,b);
21.586031 seconds
julia>  @time ccall(("dgemm", libmkl_rt), Cvoid,
(Ref{UInt8}, Ref{UInt8}, Ref{BlasInt}, Ref{BlasInt},
Ref{BlasInt}, Ref{Float64}, Ptr{Float64}, Ref{BlasInt},
Ptr{Float64}, Ref{BlasInt}, Ref{Float64}, Ptr{Float64},
Ref{BlasInt}),
'N', 'N', 12870, 11440, 11440, 1.0, a, 12870, b, 11440, 0.0, c, 12870)
42.274859 seconds (3 allocations: 48 bytes)
``````

I don’t think this affects your `ccall` to MKL?

No
In julia openblas is the default BLAS library. If I change `BLAS.set_num_threads(32)` that will change only the openblas threads. In ccall it does not change the MKL threads.

`mkl.c`:

``````#include<stdio.h>
#include<math.h>
#include<mkl.h>
#include<omp.h>

int main()
{
double *a, *b,*c;
MKL_INT i,k,m,n;
double alpha,beta;
double wtime;
alpha=1.0;
beta=0.0;
a=calloc(12870*11440,sizeof(double));
b=calloc(11440*11440,sizeof(double));
c=calloc(12870*11440,sizeof(double));
m=12870;
n=11440;
k=11440;
for (i=0;i<m*k;i++)
{
a[i]=0.005*log(i+1);
}
for(i=0;i<k*n;i++)
{
b[i]=sqrt(i)/sqrt(i+1);
}
wtime=omp_get_wtime();
dgemm("N","N",&m,&n,&k,&alpha,a,&m,b,&k,&beta,c,&m);
wtime=omp_get_wtime() - wtime;
printf("matrix mult took omp time=%lf\n",wtime);
free(a);
free(b);
free(c);
return 0;
}
``````

Running it:

``````% gcc -O3 -I/opt/intel/mkl/include/ -L/opt/intel/mkl/lib -L/opt/intel/mkl/lib/intel64 foo.c -o mkl -lmkl_rt -liomp5 -lm
matrix mult took omp time=26.759417
``````

`mkl.jl`:

``````using MKL_jll, LinearAlgebra
import LinearAlgebra: BlasInt

a = fill(0.0,(12870, 11440));
b = fill(0.0,(11440, 11440));
c = fill(0.0,(12870, 11440));
for i=1:12870*11440
a[i]=0.005*log(i)
end
for i=1:11440*11440
b[i]=sqrt(i-1)/sqrt(i)
end

@time mul!(c,a,b);

@time ccall(("dgemm", libmkl_rt), Cvoid,
(Ref{UInt8}, Ref{UInt8}, Ref{BlasInt}, Ref{BlasInt},
Ref{BlasInt}, Ref{Float64}, Ptr{Float64}, Ref{BlasInt},
Ptr{Float64}, Ref{BlasInt}, Ref{Float64}, Ptr{Float64},
Ref{BlasInt}),
'N', 'N', 12870, 11440, 11440, 1.0, a, 12870, b, 11440, 0.0, c, 12870)
``````

Running it:

``````% julia --project=. --startup-file=no -O2 mkl.jl
29.943452 seconds (2.70 M allocations: 130.031 MiB, 0.09% gc time)
25.751009 seconds (67 allocations: 4.359 KiB)
``````

Looks pretty much the same

1 Like

I am unable to add the MKL_jll and IntelOpenMP_jll packages to the workstation (as a user) so I use the Libdl option to find the path of the mkl library.

``````using Libdl
const global libmkl_rt = Libdl.find_library(["libmkl_rt"], ["/opt/intel/mkl/lib"])
``````

and for that case, no time improvement happens. On my desktop, I am able to use MKL_jll and it improves the performance. I think `const global libmkl_rt = Libdl.find_library(["libmkl_rt"], ["/opt/intel/mkl/lib"])` is causing the issue. How to solve the issue?

Why?

It shows the error

``````ERROR: Unable to automatically install 'IntelOpenMP' from '/home/j_tanu/.julia/packages/IntelOpenMP_jll/hsAKN/Artifacts.toml'`
Stacktrace:
[1] error(::String) at ./error.jl:33
[6] add(::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}, ::Array{Base.UUID,1}; preserve::Pkg.Types.PreserveLevel, platform::Pkg.BinaryPlatforms.Linux) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/Operations.jl:1141
[7] add(::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}; preserve::Pkg.Types.PreserveLevel, platform::Pkg.BinaryPlatforms.Linux, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/API.jl:189
[11] do_cmd!(::Pkg.REPLMode.Command, ::REPL.LineEditREPL) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/REPLMode/REPLMode.jl:404
[12] do_cmd(::REPL.LineEditREPL, ::String; do_rethrow::Bool) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/REPLMode/REPLMode.jl:382
[13] do_cmd at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/REPLMode/REPLMode.jl:377 [inlined]
[14] (::Pkg.REPLMode.var"#24#27"{REPL.LineEditREPL,REPL.LineEdit.Prompt})(::REPL.LineEdit.MIState, ::Base.GenericIOBuffer{Array{UInt8,1}}, ::Bool) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/REPLMode/REPLMode.jl:546
[15] #invokelatest#1 at ./essentials.jl:710 [inlined]
[16] invokelatest at ./essentials.jl:709 [inlined]
[17] run_interface(::REPL.Terminals.TextTerminal, ::REPL.LineEdit.ModalInterface, ::REPL.LineEdit.MIState) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/LineEdit.jl:2355
[18] run_frontend(::REPL.LineEditREPL, ::REPL.REPLBackendRef) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:1144

``````

It is a very old workstation and also I don’t have the sudo permission.

Even with these lines instead of using `MKL_jll` I can’t reproduce your slowdown. I don’t know why you have `global` there, it looks useless, but it doesn’t affect performance for me

Are you behind a proxy/firewall? https://github.com/giordano/DebugArtifacts.jl might help in finding out why installation is failing

That isn’t needed at all

2 Likes

I think the problem is related to the curl I am getting this kind of error

``````julia> debug_artifact("IntelOpenMP_jll")
[ Info: Platform: Linux(:x86_64, libc=:glibc, compiler_abi=CompilerABI(libgfortran_version=v"4.0.0", cxxstring_abi=:cxx11))
Julia Version 1.5.1
Commit 697e782ab8 (2020-08-25 20:08 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD Opteron(TM) Processor 6274
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, bdver1)

ProcessFailedException(Base.Process[Process(`curl -C - '-#' -f -o /tmp/jl_DkQCSd/Artifacts.toml -L https://raw.githubusercontent.com/JuliaBinaryWrappers/IntelOpenMP_jll_jll.jl/master/Artifacts.toml`, ProcessExited(22))])
Stacktrace:
[1] error(::String) at ./error.jl:33
[3] (::DebugArtifacts.var"#3#4"{String,Pkg.BinaryPlatforms.Linux,String})(::String) at /home/j_tanu/.julia/packages/DebugArtifacts/VcVh2/src/DebugArtifacts.jl:57
[4] mktempdir(::DebugArtifacts.var"#3#4"{String,Pkg.BinaryPlatforms.Linux,String}, ::String; prefix::String) at ./file.jl:682
[5] mktempdir at ./file.jl:680 [inlined] (repeats 2 times)
[6] debug_artifact(::String, ::Pkg.BinaryPlatforms.Linux) at /home/j_tanu/.julia/packages/DebugArtifacts/VcVh2/src/DebugArtifacts.jl:53
[7] debug_artifact(::String) at /home/j_tanu/.julia/packages/DebugArtifacts/VcVh2/src/DebugArtifacts.jl:43
[8] top-level scope at REPL[3]:1
caused by [exception 1]
failed process: Process(`curl -C - '-#' -f -o /tmp/jl_DkQCSd/Artifacts.toml -L https://raw.githubusercontent.com/JuliaBinaryWrappers/IntelOpenMP_jll_jll.jl/master/Artifacts.toml`, ProcessExited(22)) [22]

Stacktrace:
[1] pipeline_error at ./process.jl:525 [inlined]
[2] run(::Cmd, ::Tuple{Base.DevNull,Base.TTY,Base.TTY}; wait::Bool) at ./process.jl:440
[3] run(::Cmd, ::Tuple{Base.DevNull,Base.TTY,Base.TTY}) at ./process.jl:438
[5] (::DebugArtifacts.var"#3#4"{String,Pkg.BinaryPlatforms.Linux,String})(::String) at /home/j_tanu/.julia/packages/DebugArtifacts/VcVh2/src/DebugArtifacts.jl:57
[6] mktempdir(::DebugArtifacts.var"#3#4"{String,Pkg.BinaryPlatforms.Linux,String}, ::String; prefix::String) at ./file.jl:682
[7] mktempdir at ./file.jl:680 [inlined] (repeats 2 times)
[8] debug_artifact(::String, ::Pkg.BinaryPlatforms.Linux) at /home/j_tanu/.julia/packages/DebugArtifacts/VcVh2/src/DebugArtifacts.jl:53
[9] debug_artifact(::String) at /home/j_tanu/.julia/packages/DebugArtifacts/VcVh2/src/DebugArtifacts.jl:43
[10] top-level scope at REPL[3]:1
``````

I have seen the use of global in some places so I used it `const global libmkl_rt = Libdl.find_library(["libmkl_rt"], ["/opt/intel/mkl/lib"])`.
In the `MKL_jll` source code they also use global in some places.

There is an extra `_jll`, remove it

When the code is not in the global scope. In the global scope, like in your case, everything is already… global

Is it IntelOpenMP?
then I am getting the error

``````julia> using DebugArtifacts

julia> debug_artifact("IntelOpenMP")
[ Info: Platform: Linux(:x86_64, libc=:glibc, compiler_abi=CompilerABI(libgfortran_version=v"4.0.0", cxxstring_abi=:cxx11))
Julia Version 1.5.1
Commit 697e782ab8 (2020-08-25 20:08 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD Opteron(TM) Processor 6274
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, bdver1)

[ Info: Probing curl as a possibility...
[ Info:   Probe successful for curl
[ Info: Probing for compression engine...
[ Info: Probing tar as a possibility...
[ Info:   Probe successful for tar
[ Info: Found compression engine tar
######################################################################## 100.0%
[ Info: Extracting artifact info for platform x86_64-linux-gnu-libgfortran4-cxx11...

curl: (35) SSL connect error
Stacktrace:
[1] error(::String) at ./error.jl:33
[5] (::DebugArtifacts.var"#3#4"{String,Pkg.BinaryPlatforms.Linux,String})(::String) at /home/j_tanu/.julia/packages/DebugArtifacts/VcVh2/src/DebugArtifacts.jl:65
[6] mktempdir(::DebugArtifacts.var"#3#4"{String,Pkg.BinaryPlatforms.Linux,String}, ::String; prefix::String) at ./file.jl:682
[7] mktempdir at ./file.jl:680 [inlined] (repeats 2 times)
[8] debug_artifact(::String, ::Pkg.BinaryPlatforms.Linux) at /home/j_tanu/.julia/packages/DebugArtifacts/VcVh2/src/DebugArtifacts.jl:53
[9] debug_artifact(::String) at /home/j_tanu/.julia/packages/DebugArtifacts/VcVh2/src/DebugArtifacts.jl:43
[10] top-level scope at REPL[2]:1
caused by [exception 1]

Stacktrace:
[1] pipeline_error at ./process.jl:525 [inlined]
[2] run(::Cmd, ::Tuple{Base.DevNull,Base.TTY,Base.TTY}; wait::Bool) at ./process.jl:440
[3] run(::Cmd, ::Tuple{Base.DevNull,Base.TTY,Base.TTY}) at ./process.jl:438
[7] (::DebugArtifacts.var"#3#4"{String,Pkg.BinaryPlatforms.Linux,String})(::String) at /home/j_tanu/.julia/packages/DebugArtifacts/VcVh2/src/DebugArtifacts.jl:65
[8] mktempdir(::DebugArtifacts.var"#3#4"{String,Pkg.BinaryPlatforms.Linux,String}, ::String; prefix::String) at ./file.jl:682
[9] mktempdir at ./file.jl:680 [inlined] (repeats 2 times)
[10] debug_artifact(::String, ::Pkg.BinaryPlatforms.Linux) at /home/j_tanu/.julia/packages/DebugArtifacts/VcVh2/src/DebugArtifacts.jl:53
[11] debug_artifact(::String) at /home/j_tanu/.julia/packages/DebugArtifacts/VcVh2/src/DebugArtifacts.jl:43
[12] top-level scope at REPL[2]:1

``````

What version of `curl` is that? This answer on StackOverflow suggests that this error can occur with very old versions of `curl`

3 Likes

As you mentioned earlier, without using the MKL_jll you are getting better performance. I am also very much confused about the slowdown of the `mkl-dgemm` library.

Did I?

Edit: do you refer to the fact I get the same performance when using `MKL_jll` and manually opening the library, which is also the same as the C program? I may have misunderstood the “better performance”

I am referring to the fact that You are getting the same performance when using MKL_jll and manually opening the library. In your case, `MKL_jll` shows the better performance over the `openblas` mul!(C,A,B)

1 Like