It have been found to be a logic problem, and the MWE is due to wrong measurement. Some multi-threaded issue may exist, but that’s completely another story and should be discussed in another thread.
A MWE:
julia> a = zeros(5);
julia> function _check_ccall(a)
ccall(
(:sum, "libexample"), Cdouble,
(
Ptr{Cdouble},
),
a
)
end
_check_ccall (generic function with 1 method)
julia> @code_warntype _check_ccall(a)
MethodInstance for _check_ccall(::Vector{Float64})
from _check_ccall(a) @ Main REPL[2]:1
Arguments
#self#::Core.Const(_check_ccall)
a::Vector{Float64}
Body::Float64
1 ─ %1 = Core.apply_type(Main.Ptr, Main.Cdouble)::Core.Const(Ptr{Float64})
│ %2 = Base.cconvert(%1, a)::Vector{Float64}
│ %3 = Core.apply_type(Main.Ptr, Main.Cdouble)::Core.Const(Ptr{Float64})
│ %4 = Base.unsafe_convert(%3, %2)::Ptr{Float64}
│ %5 = $(Expr(:foreigncall, :(Core.tuple(:sum, "libexample")), Float64, svec(Ptr{Float64}), 0, :(:ccall), :(%4), :(%2)))::Float64
└── return %5
We notice the unsafe_convert
call, and let’s examine its cost:
julia> using BenchmarkTools
julia> @btime Base.unsafe_convert(Core.apply_type(Ptr, Cdouble), a)
107.665 ns (1 allocation: 16 bytes)
Ptr{Float64} @0x00007f7de9ae8160
On the contrary, the Base.cconvert
call only costs a few nanoseconds and can be ignored.
In my actual user case, I need to perform a ccall about 40 million times. My code costs about 130s, while the correponding python code costs about 110s. The difference is about 500 ns per ccall, and considering my ccall has 6 unsafe_convert
s, such difference can be well explained.
The problem is, how can I get rid of such overheads? The arrays I use will not change, so at first I thought I can manually perform the convert by myself. However, this has no effect:
julia> function _check_ccall2(a)
b = Base.unsafe_convert(Core.apply_type(Ptr, Cdouble), a)
ccall(
(:sum, "libexample"), Cdouble,
(
Ptr{Cdouble},
),
b
)
end
_check_ccall2 (generic function with 1 method)
julia> @code_warntype _check_ccall2(a)
MethodInstance for _check_ccall2(::Vector{Float64})
from _check_ccall2(a) @ Main REPL[13]:1
Arguments
#self#::Core.Const(_check_ccall2)
a::Vector{Float64}
Locals
b::Ptr{Float64}
Body::Float64
1 ─ %1 = Base.unsafe_convert::Core.Const(Base.unsafe_convert)
│ %2 = Core.apply_type::Core.Const(Core.apply_type)
│ %3 = Main.Ptr::Core.Const(Ptr)
│ %4 = (%2)(%3, Main.Cdouble)::Core.Const(Ptr{Float64})
│ (b = (%1)(%4, a))
│ %6 = Core.apply_type(Main.Ptr, Main.Cdouble)::Core.Const(Ptr{Float64})
│ %7 = Base.cconvert(%6, b)::Ptr{Float64}
│ %8 = Core.apply_type(Main.Ptr, Main.Cdouble)::Core.Const(Ptr{Float64})
│ %9 = Base.unsafe_convert(%8, %7)::Ptr{Float64}
│ %10 = $(Expr(:foreigncall, :(Core.tuple(:sum, "libexample")), Float64, svec(Ptr{Float64}), 0, :(:ccall), :(%9), :(%7)))::Float64
└── return %10
Is there any way to get rid of such overheads? Without so it seems impossible to become as efficient as the python code, since these ccalls, cost only a few milleseconds each but are called so many times in the program, contributes to about 80% of the total time used.
A quick and dirty way may be to write a C script on my own to wrap the for-loop that calls these ccalls. Then, in my Julia code, I only need to ccall my wrapper function a few times. However, from my point of view, such solution is far from satisfactory. If I have to write C code after all, why bother using Julia?