(misintepreted and not exist) Overhead in ccall due to unsafe_convert of pointers

It have been found to be a logic problem, and the MWE is due to wrong measurement. Some multi-threaded issue may exist, but that’s completely another story and should be discussed in another thread.

A MWE:

julia> a = zeros(5);

julia> function _check_ccall(a)
                  ccall(
                      (:sum, "libexample"), Cdouble,
                      (
                          Ptr{Cdouble},
                      ),
                      a
                  )
              end
_check_ccall (generic function with 1 method)

julia> @code_warntype _check_ccall(a)
MethodInstance for _check_ccall(::Vector{Float64})
  from _check_ccall(a) @ Main REPL[2]:1
Arguments
  #self#::Core.Const(_check_ccall)
  a::Vector{Float64}
Body::Float64
1 ─ %1 = Core.apply_type(Main.Ptr, Main.Cdouble)::Core.Const(Ptr{Float64})
│   %2 = Base.cconvert(%1, a)::Vector{Float64}
│   %3 = Core.apply_type(Main.Ptr, Main.Cdouble)::Core.Const(Ptr{Float64})
│   %4 = Base.unsafe_convert(%3, %2)::Ptr{Float64}
│   %5 = $(Expr(:foreigncall, :(Core.tuple(:sum, "libexample")), Float64, svec(Ptr{Float64}), 0, :(:ccall), :(%4), :(%2)))::Float64
└──      return %5

We notice the unsafe_convert call, and let’s examine its cost:

julia> using BenchmarkTools

julia> @btime Base.unsafe_convert(Core.apply_type(Ptr, Cdouble), a)
  107.665 ns (1 allocation: 16 bytes)
Ptr{Float64} @0x00007f7de9ae8160

On the contrary, the Base.cconvert call only costs a few nanoseconds and can be ignored.

In my actual user case, I need to perform a ccall about 40 million times. My code costs about 130s, while the correponding python code costs about 110s. The difference is about 500 ns per ccall, and considering my ccall has 6 unsafe_converts, such difference can be well explained.

The problem is, how can I get rid of such overheads? The arrays I use will not change, so at first I thought I can manually perform the convert by myself. However, this has no effect:

julia> function _check_ccall2(a)
           b = Base.unsafe_convert(Core.apply_type(Ptr, Cdouble), a)
                  ccall(
                      (:sum, "libexample"), Cdouble,
                      (
                          Ptr{Cdouble},
                      ),
                      b
                  )
                  end
_check_ccall2 (generic function with 1 method)

julia> @code_warntype _check_ccall2(a)
MethodInstance for _check_ccall2(::Vector{Float64})
  from _check_ccall2(a) @ Main REPL[13]:1
Arguments
  #self#::Core.Const(_check_ccall2)
  a::Vector{Float64}
Locals
  b::Ptr{Float64}
Body::Float64
1 ─ %1  = Base.unsafe_convert::Core.Const(Base.unsafe_convert)
│   %2  = Core.apply_type::Core.Const(Core.apply_type)
│   %3  = Main.Ptr::Core.Const(Ptr)
│   %4  = (%2)(%3, Main.Cdouble)::Core.Const(Ptr{Float64})
│         (b = (%1)(%4, a))
│   %6  = Core.apply_type(Main.Ptr, Main.Cdouble)::Core.Const(Ptr{Float64})
│   %7  = Base.cconvert(%6, b)::Ptr{Float64}
│   %8  = Core.apply_type(Main.Ptr, Main.Cdouble)::Core.Const(Ptr{Float64})
│   %9  = Base.unsafe_convert(%8, %7)::Ptr{Float64}
│   %10 = $(Expr(:foreigncall, :(Core.tuple(:sum, "libexample")), Float64, svec(Ptr{Float64}), 0, :(:ccall), :(%9), :(%7)))::Float64
└──       return %10

Is there any way to get rid of such overheads? Without so it seems impossible to become as efficient as the python code, since these ccalls, cost only a few milleseconds each but are called so many times in the program, contributes to about 80% of the total time used.


A quick and dirty way may be to write a C script on my own to wrap the for-loop that calls these ccalls. Then, in my Julia code, I only need to ccall my wrapper function a few times. However, from my point of view, such solution is far from satisfactory. If I have to write C code after all, why bother using Julia?

I strongly suspect that you’re not measuring what you think you’re measuring. For example, the “allocation” you’re seeing in your benchmark for unsafe_convert is very likely due to a being an untyped, non-const global variable, and not due to unsafe_convert. In particular, consider the first two performance tips from the docs.

Also, there is no need for using Core.apply_type here. You can just write Ptr{Cdouble}. See also the @ccall docs for how to use ccall.

Could you show a representative MWE of your actual benchmark for the 130s that we can replicate on our end?

1 Like

Yep, I forgot the dollar sign. The actual case is somehow complex and not easy to offer. I will first close this thread and debug more myself. If I find something more useful, I’ll open it again.

1 Like

I’ve found that the performance difference may have nothing to do with ccall. However, I suspect I still have to resort to C for better performance, which is another story and completely off topic.

In case you are interested, the python code is a simple wrapper for a complex C library with many layers of function calls. I’ve found the most important part and translated it into Julia, but I’ve missed a presceening part which quickly returns for some terms. This may be the main reason for performance differences.

offtopic part


But I still have difficulty in translating it into Julia. The most important problem is that I get some trouble in Julia multithreading, as shown in my previous post. (In fact, my user case in this post is exactly the piece of code in that post, despite of some optimization due to symmetry I’ve made recently, and that I’m now calculating larger molecules than the simple water molecule in that post) I’ve finally found a solution in that post, but the solution needs to split the workload into different threads BEFORE actually doing the work.

Since now I need to use the prescreening function to enhance performance, there is no chance that I can split the workload well beforehand. I thus suspect that the load balance of my Julia code will be much worse than the original C code, which I can find no solution.