Ccall performance

Hi,
I am doing a matrix-matrix multiplication A(12870*11440)*B(11440*11440) using mkl_dgemm library. In c language, The whole calculation takes about 20 sec but if I call the mkl_dgemm using Julia’s ccall it takes 42.274859 sec. Inbuilt mul!(c,a,b) do the same calculation in 21.586031 seconds. I can’t understand how to improve these extra 20 seconds. I am attaching a part of my code.

julia> @time mul!(c,a,b);
21.586031 seconds
julia> @time ccall(("dgemm", libmkl_rt), Cvoid,
                       (Ref{UInt8}, Ref{UInt8}, Ref{BlasInt}, Ref{BlasInt},
                        Ref{BlasInt}, Ref{Float64}, Ptr{Float64}, Ref{BlasInt},
                        Ptr{Float64}, Ref{BlasInt}, Ref{Float64}, Ptr{Float64},
                        Ref{BlasInt}),
                        'N', 'N', 12870, 11440, 11440, 1.0, a, 12870, b, 11440, 0.0, c, 12870)
42.274859 seconds (3 allocations: 48 bytes)

As a general remark, adding the full code to reproduce an issue highly increases the chances someone will be willing to help you :wink:

In this case, I can’t reproduce your issue:

julia> using MKL_jll, LinearAlgebra

julia> import LinearAlgebra: BlasInt

julia> A = rand(12870, 11440);

julia> B = rand(11440, 11440);

julia> C = Matrix{Float64}(undef, 12870, 11440);

julia> @time mul!(C, A, B);
 33.708089 seconds (2.70 M allocations: 129.942 MiB)

julia> @time ccall(("dgemm", libmkl_rt), Cvoid,
                       (Ref{UInt8}, Ref{UInt8}, Ref{BlasInt}, Ref{BlasInt},
                        Ref{BlasInt}, Ref{Float64}, Ptr{Float64}, Ref{BlasInt},
                        Ptr{Float64}, Ref{BlasInt}, Ref{Float64}, Ptr{Float64},
                        Ref{BlasInt}),
                        'N', 'N', 12870, 11440, 11440, 1.0, A, 12870, B, 11440, 0.0, C, 12870)
 29.913322 seconds (67 allocations: 4.359 KiB)

You didn’t provide the C code you used for you benchmark, I’m not going to spend time on writing it on my own :slightly_smiling_face:

3 Likes

Ok
I am attaching the full code

#include<stdio.h>
#include<math.h>
#include<mkl.h>
#include<omp.h>
int main()
{
double *a, *b,*c;
MKL_INT i,k,m,n;
double alpha,beta;
double wtime;
alpha=1.0;
beta=0.0;
a=calloc(12870*11440,sizeof(double));
b=calloc(11440*11440,sizeof(double));
c=calloc(12870*11440,sizeof(double));
m=12870;
n=11440;
k=11440;
for (i=0;i<m*k;i++)
{
a[i]=0.005*log(i+1);
}
for(i=0;i<k*n;i++)
{
b[i]=sqrt(i)/sqrt(i+1);
}
wtime=omp_get_wtime();
dgemm("N","N",&m,&n,&k,&alpha,a,&m,b,&k,&beta,c,&m);
wtime=omp_get_wtime() - wtime;
 printf("matrix mult took omp time=%lf\n",wtime);
free(a);
free(b);
free(c);

my output is
matrix mult took omp time=13.502873
if I define omp_num_threads as 32
export OMP_NUM_THREADS=32
matrix mult took omp time=18.259101
and In Julia

julia> using LinearAlgebra,Libdl

julia> import LinearAlgebra: BlasInt
julia> const global libmkl_rt = Libdl.find_library(["libmkl_rt"], ["/opt/intel/mkl/lib"]);
julia> a = fill(0.0,(12870, 11440));
julia> b = fill(0.0,(11440, 11440));
julia> c = fill(0.0,(12870, 11440));
julia> for i=1:12870*11440
       a[i]=0.005*log(i)
       end
julia> for i=1:11440*11440
       b[i]=sqrt(i-1)/sqrt(i)
       end
julia> BLAS.set_num_threads(32)
julia> @time mul!(c,a,b);
21.586031 seconds
julia>  @time ccall(("dgemm", libmkl_rt), Cvoid,
                       (Ref{UInt8}, Ref{UInt8}, Ref{BlasInt}, Ref{BlasInt},
                        Ref{BlasInt}, Ref{Float64}, Ptr{Float64}, Ref{BlasInt},
                        Ptr{Float64}, Ref{BlasInt}, Ref{Float64}, Ptr{Float64},
                        Ref{BlasInt}),
                        'N', 'N', 12870, 11440, 11440, 1.0, a, 12870, b, 11440, 0.0, c, 12870)
42.274859 seconds (3 allocations: 48 bytes)               

I don’t think this affects your ccall to MKL?

No
In julia openblas is the default BLAS library. If I change BLAS.set_num_threads(32) that will change only the openblas threads. In ccall it does not change the MKL threads.

mkl.c:

#include<stdio.h>
#include<math.h>
#include<mkl.h>
#include<omp.h>

int main()
{
    double *a, *b,*c;
    MKL_INT i,k,m,n;
    double alpha,beta;
    double wtime;
    alpha=1.0;
    beta=0.0;
    a=calloc(12870*11440,sizeof(double));
    b=calloc(11440*11440,sizeof(double));
    c=calloc(12870*11440,sizeof(double));
    m=12870;
    n=11440;
    k=11440;
    for (i=0;i<m*k;i++)
        {
            a[i]=0.005*log(i+1);
        }
    for(i=0;i<k*n;i++)
        {
            b[i]=sqrt(i)/sqrt(i+1);
        }
    wtime=omp_get_wtime();
    dgemm("N","N",&m,&n,&k,&alpha,a,&m,b,&k,&beta,c,&m);
    wtime=omp_get_wtime() - wtime;
    printf("matrix mult took omp time=%lf\n",wtime);
    free(a);
    free(b);
    free(c);
    return 0;
}

Running it:

% gcc -O3 -I/opt/intel/mkl/include/ -L/opt/intel/mkl/lib -L/opt/intel/mkl/lib/intel64 foo.c -o mkl -lmkl_rt -liomp5 -lm
% OMP_NUM_THREADS=32 ./mkl
matrix mult took omp time=26.759417

mkl.jl:

using MKL_jll, LinearAlgebra
import LinearAlgebra: BlasInt

a = fill(0.0,(12870, 11440));
b = fill(0.0,(11440, 11440));
c = fill(0.0,(12870, 11440));
for i=1:12870*11440
    a[i]=0.005*log(i)
end
for i=1:11440*11440
    b[i]=sqrt(i-1)/sqrt(i)
end

BLAS.set_num_threads(32)
@time mul!(c,a,b);

ENV["OMP_NUM_THREADS"] = "32"
@time ccall(("dgemm", libmkl_rt), Cvoid,
            (Ref{UInt8}, Ref{UInt8}, Ref{BlasInt}, Ref{BlasInt},
             Ref{BlasInt}, Ref{Float64}, Ptr{Float64}, Ref{BlasInt},
             Ptr{Float64}, Ref{BlasInt}, Ref{Float64}, Ptr{Float64},
             Ref{BlasInt}),
            'N', 'N', 12870, 11440, 11440, 1.0, a, 12870, b, 11440, 0.0, c, 12870)

Running it:

% julia --project=. --startup-file=no -O2 mkl.jl
 29.943452 seconds (2.70 M allocations: 130.031 MiB, 0.09% gc time)
 25.751009 seconds (67 allocations: 4.359 KiB)

Looks pretty much the same

1 Like

I am unable to add the MKL_jll and IntelOpenMP_jll packages to the workstation (as a user) so I use the Libdl option to find the path of the mkl library.

using Libdl
const global libmkl_rt = Libdl.find_library(["libmkl_rt"], ["/opt/intel/mkl/lib"])

and for that case, no time improvement happens. On my desktop, I am able to use MKL_jll and it improves the performance. I think const global libmkl_rt = Libdl.find_library(["libmkl_rt"], ["/opt/intel/mkl/lib"]) is causing the issue. How to solve the issue?

Why?

It shows the error

ERROR: Unable to automatically install 'IntelOpenMP' from '/home/j_tanu/.julia/packages/IntelOpenMP_jll/hsAKN/Artifacts.toml'`
Stacktrace:
[1] error(::String) at ./error.jl:33
[2] ensure_artifact_installed(::String, ::Dict{String,Any}, ::String; platform::Pkg.BinaryPlatforms.Platform, verbose::Bool, quiet_download::Bool) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/Artifacts.jl:898
[3] ensure_all_artifacts_installed(::String; platform::Pkg.BinaryPlatforms.Platform, pkg_uuid::Nothing, include_lazy::Bool, verbose::Bool, quiet_download::Bool) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/Artifacts.jl:962
[4] download_artifacts(::Pkg.Types.Context, ::Array{String,1}; platform::Pkg.BinaryPlatforms.Linux, verbose::Bool) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/Operations.jl:663
[5] download_artifacts(::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}; platform::Pkg.BinaryPlatforms.Linux, verbose::Bool) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/Operations.jl:642
[6] add(::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}, ::Array{Base.UUID,1}; preserve::Pkg.Types.PreserveLevel, platform::Pkg.BinaryPlatforms.Linux) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/Operations.jl:1141
[7] add(::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}; preserve::Pkg.Types.PreserveLevel, platform::Pkg.BinaryPlatforms.Linux, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/API.jl:189
[8] add(::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/API.jl:140
[9] #add#21 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/API.jl:67 [inlined]
[10] add(::Array{Pkg.Types.PackageSpec,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/API.jl:67
[11] do_cmd!(::Pkg.REPLMode.Command, ::REPL.LineEditREPL) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/REPLMode/REPLMode.jl:404
[12] do_cmd(::REPL.LineEditREPL, ::String; do_rethrow::Bool) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/REPLMode/REPLMode.jl:382
[13] do_cmd at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/REPLMode/REPLMode.jl:377 [inlined]
[14] (::Pkg.REPLMode.var"#24#27"{REPL.LineEditREPL,REPL.LineEdit.Prompt})(::REPL.LineEdit.MIState, ::Base.GenericIOBuffer{Array{UInt8,1}}, ::Bool) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/REPLMode/REPLMode.jl:546
[15] #invokelatest#1 at ./essentials.jl:710 [inlined]
[16] invokelatest at ./essentials.jl:709 [inlined]
[17] run_interface(::REPL.Terminals.TextTerminal, ::REPL.LineEdit.ModalInterface, ::REPL.LineEdit.MIState) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/LineEdit.jl:2355
[18] run_frontend(::REPL.LineEditREPL, ::REPL.REPLBackendRef) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:1144
[19] (::REPL.var"#38#42"{REPL.LineEditREPL,REPL.REPLBackendRef})() at ./task.jl:356

It is a very old workstation and also I don’t have the sudo permission.

Even with these lines instead of using MKL_jll I can’t reproduce your slowdown. I don’t know why you have global there, it looks useless, but it doesn’t affect performance for me

Are you behind a proxy/firewall? https://github.com/giordano/DebugArtifacts.jl might help in finding out why installation is failing

That isn’t needed at all

2 Likes

I think the problem is related to the curl I am getting this kind of error

julia> debug_artifact("IntelOpenMP_jll")
[ Info: Platform: Linux(:x86_64, libc=:glibc, compiler_abi=CompilerABI(libgfortran_version=v"4.0.0", cxxstring_abi=:cxx11))
Julia Version 1.5.1
Commit 697e782ab8 (2020-08-25 20:08 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Opteron(TM) Processor 6274                 
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, bdver1)

[ Info: Downloading Artifacts.toml to /tmp/jl_DkQCSd/Artifacts.toml...

curl: (22) The requested URL returned error: 404 Not Found
ERROR: Could not download https://raw.githubusercontent.com/JuliaBinaryWrappers/IntelOpenMP_jll_jll.jl/master/Artifacts.toml to /tmp/jl_DkQCSd/Artifacts.toml:
ProcessFailedException(Base.Process[Process(`curl -C - '-#' -f -o /tmp/jl_DkQCSd/Artifacts.toml -L https://raw.githubusercontent.com/JuliaBinaryWrappers/IntelOpenMP_jll_jll.jl/master/Artifacts.toml`, ProcessExited(22))])
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] download(::String, ::String; verbose::Bool, auth_header::Nothing) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/PlatformEngines.jl:822
 [3] (::DebugArtifacts.var"#3#4"{String,Pkg.BinaryPlatforms.Linux,String})(::String) at /home/j_tanu/.julia/packages/DebugArtifacts/VcVh2/src/DebugArtifacts.jl:57
 [4] mktempdir(::DebugArtifacts.var"#3#4"{String,Pkg.BinaryPlatforms.Linux,String}, ::String; prefix::String) at ./file.jl:682
 [5] mktempdir at ./file.jl:680 [inlined] (repeats 2 times)
 [6] debug_artifact(::String, ::Pkg.BinaryPlatforms.Linux) at /home/j_tanu/.julia/packages/DebugArtifacts/VcVh2/src/DebugArtifacts.jl:53
 [7] debug_artifact(::String) at /home/j_tanu/.julia/packages/DebugArtifacts/VcVh2/src/DebugArtifacts.jl:43
 [8] top-level scope at REPL[3]:1
caused by [exception 1]
failed process: Process(`curl -C - '-#' -f -o /tmp/jl_DkQCSd/Artifacts.toml -L https://raw.githubusercontent.com/JuliaBinaryWrappers/IntelOpenMP_jll_jll.jl/master/Artifacts.toml`, ProcessExited(22)) [22]

Stacktrace:
 [1] pipeline_error at ./process.jl:525 [inlined]
 [2] run(::Cmd, ::Tuple{Base.DevNull,Base.TTY,Base.TTY}; wait::Bool) at ./process.jl:440
 [3] run(::Cmd, ::Tuple{Base.DevNull,Base.TTY,Base.TTY}) at ./process.jl:438
 [4] download(::String, ::String; verbose::Bool, auth_header::Nothing) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/PlatformEngines.jl:817
 [5] (::DebugArtifacts.var"#3#4"{String,Pkg.BinaryPlatforms.Linux,String})(::String) at /home/j_tanu/.julia/packages/DebugArtifacts/VcVh2/src/DebugArtifacts.jl:57
 [6] mktempdir(::DebugArtifacts.var"#3#4"{String,Pkg.BinaryPlatforms.Linux,String}, ::String; prefix::String) at ./file.jl:682
 [7] mktempdir at ./file.jl:680 [inlined] (repeats 2 times)
 [8] debug_artifact(::String, ::Pkg.BinaryPlatforms.Linux) at /home/j_tanu/.julia/packages/DebugArtifacts/VcVh2/src/DebugArtifacts.jl:53
 [9] debug_artifact(::String) at /home/j_tanu/.julia/packages/DebugArtifacts/VcVh2/src/DebugArtifacts.jl:43
 [10] top-level scope at REPL[3]:1

I have seen the use of global in some places so I used it const global libmkl_rt = Libdl.find_library(["libmkl_rt"], ["/opt/intel/mkl/lib"]).
In the MKL_jll source code they also use global in some places.

There is an extra _jll, remove it

When the code is not in the global scope. In the global scope, like in your case, everything is already… global

Is it IntelOpenMP?
then I am getting the error

julia> using DebugArtifacts

julia> debug_artifact("IntelOpenMP")
[ Info: Platform: Linux(:x86_64, libc=:glibc, compiler_abi=CompilerABI(libgfortran_version=v"4.0.0", cxxstring_abi=:cxx11))
Julia Version 1.5.1
Commit 697e782ab8 (2020-08-25 20:08 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Opteron(TM) Processor 6274                 
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, bdver1)

[ Info: Probing for download engine...
[ Info: Probing curl as a possibility...
[ Info:   Probe successful for curl
[ Info: Found download engine curl
[ Info: Probing for compression engine...
[ Info: Probing tar as a possibility...
[ Info:   Probe successful for tar
[ Info: Found compression engine tar
[ Info: Downloading Artifacts.toml to /tmp/jl_5Q6tZf/Artifacts.toml...
######################################################################## 100.0%
[ Info: Extracting artifact info for platform x86_64-linux-gnu-libgfortran4-cxx11...
[ Info: Found meta object with git-tree-sha1 b1a56c9d3370406815cbf264d45bb4d2c3f00047, attempting download...

curl: (35) SSL connect error
ERROR: Could not download https://github.com/JuliaBinaryWrappers/IntelOpenMP_jll.jl/releases/download/IntelOpenMP-v2018.0.3+0/IntelOpenMP.v2018.0.3.x86_64-linux-gnu.tar.gz to /tmp/jl_wrjKiL-download.gz:
ProcessFailedException(Base.Process[Process(`curl -C - '-#' -f -o /tmp/jl_wrjKiL-download.gz -L https://github.com/JuliaBinaryWrappers/IntelOpenMP_jll.jl/releases/download/IntelOpenMP-v2018.0.3+0/IntelOpenMP.v2018.0.3.x86_64-linux-gnu.tar.gz`, ProcessExited(35))])
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] download(::String, ::String; verbose::Bool, auth_header::Nothing) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/PlatformEngines.jl:822
 [3] download_verify(::String, ::String, ::String; verbose::Bool, force::Bool, quiet_download::Bool) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/PlatformEngines.jl:883
 [4] download_verify_unpack(::String, ::String, ::String; tarball_path::Nothing, ignore_existence::Bool, force::Bool, verbose::Bool, quiet_download::Bool) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/PlatformEngines.jl:1121
 [5] (::DebugArtifacts.var"#3#4"{String,Pkg.BinaryPlatforms.Linux,String})(::String) at /home/j_tanu/.julia/packages/DebugArtifacts/VcVh2/src/DebugArtifacts.jl:65
 [6] mktempdir(::DebugArtifacts.var"#3#4"{String,Pkg.BinaryPlatforms.Linux,String}, ::String; prefix::String) at ./file.jl:682
 [7] mktempdir at ./file.jl:680 [inlined] (repeats 2 times)
 [8] debug_artifact(::String, ::Pkg.BinaryPlatforms.Linux) at /home/j_tanu/.julia/packages/DebugArtifacts/VcVh2/src/DebugArtifacts.jl:53
 [9] debug_artifact(::String) at /home/j_tanu/.julia/packages/DebugArtifacts/VcVh2/src/DebugArtifacts.jl:43
 [10] top-level scope at REPL[2]:1
caused by [exception 1]
failed process: Process(`curl -C - '-#' -f -o /tmp/jl_wrjKiL-download.gz -L https://github.com/JuliaBinaryWrappers/IntelOpenMP_jll.jl/releases/download/IntelOpenMP-v2018.0.3+0/IntelOpenMP.v2018.0.3.x86_64-linux-gnu.tar.gz`, ProcessExited(35)) [35]

Stacktrace:
 [1] pipeline_error at ./process.jl:525 [inlined]
 [2] run(::Cmd, ::Tuple{Base.DevNull,Base.TTY,Base.TTY}; wait::Bool) at ./process.jl:440
 [3] run(::Cmd, ::Tuple{Base.DevNull,Base.TTY,Base.TTY}) at ./process.jl:438
 [4] download(::String, ::String; verbose::Bool, auth_header::Nothing) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/PlatformEngines.jl:817
 [5] download_verify(::String, ::String, ::String; verbose::Bool, force::Bool, quiet_download::Bool) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/PlatformEngines.jl:883
 [6] download_verify_unpack(::String, ::String, ::String; tarball_path::Nothing, ignore_existence::Bool, force::Bool, verbose::Bool, quiet_download::Bool) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/PlatformEngines.jl:1121
 [7] (::DebugArtifacts.var"#3#4"{String,Pkg.BinaryPlatforms.Linux,String})(::String) at /home/j_tanu/.julia/packages/DebugArtifacts/VcVh2/src/DebugArtifacts.jl:65
 [8] mktempdir(::DebugArtifacts.var"#3#4"{String,Pkg.BinaryPlatforms.Linux,String}, ::String; prefix::String) at ./file.jl:682
 [9] mktempdir at ./file.jl:680 [inlined] (repeats 2 times)
 [10] debug_artifact(::String, ::Pkg.BinaryPlatforms.Linux) at /home/j_tanu/.julia/packages/DebugArtifacts/VcVh2/src/DebugArtifacts.jl:53
 [11] debug_artifact(::String) at /home/j_tanu/.julia/packages/DebugArtifacts/VcVh2/src/DebugArtifacts.jl:43
 [12] top-level scope at REPL[2]:1

What version of curl is that? This answer on StackOverflow suggests that this error can occur with very old versions of curl

3 Likes

As you mentioned earlier, without using the MKL_jll you are getting better performance. I am also very much confused about the slowdown of the mkl-dgemm library.

Did I? :flushed:

Edit: do you refer to the fact I get the same performance when using MKL_jll and manually opening the library, which is also the same as the C program? I may have misunderstood the “better performance”

:grinning: :grinning:
I am referring to the fact that You are getting the same performance when using MKL_jll and manually opening the library. In your case, MKL_jll shows the better performance over the openblas mul!(C,A,B)

1 Like