FunctionWrappers showing ~25x overhead over a native function call

question

#1

I am seeing performance a big difference between using FunctionWrappers and native function call. Is this to be expected ?
Please see the MWE below:

using BenchmarkTools  
using FunctionWrappers: FunctionWrapper 

struct A{T}
   x::T
end

get_power(a::A{T}, y::T,z::T) where {T} = a.x + y + z

const Fwarpper = FunctionWrapper{Float64,Tuple{A{Float64},Float64,Float64}}
const dict = Vector{Fwarpper}(1)

dict[1] = Fwarpper(get_power)
a = A(1.5)

@btime @inbounds get_power($a,5.6,7.87);
  0.488 ns (0 allocations: 0 bytes)

@btime @inbounds dict[1]($a,5.6,7.87);
  12.863 ns (0 allocations: 0 bytes)

This suggest that there is a ~25X overhead for this case ?
Is it normal to expect this overhead between explicitly calling a function and calling a function via FunctionWrappers?
It seems like this [post] (Performance of functions returned from container) @yuyichao seemed to suggest there there is no overhead.

My Julia VersionInfo

Commit 68e911b (2017-05-18 02:31 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, ivybridge)

I might be doing this all wrong as I am using FunctionWrappers for the first time.

thanks!


#2

Firstly, remove the @inbounds in front of the get_power. It makes the expression not return anything so everything is optimized away. Also, why are you even using it here at all?

Secondly, why are you benchmarking the time it takes to do the dictionary lookup for the FunctionWrapper benchmarking but not for the standard function?


#3

Thanks for the reply @kristoffer.carlsson and the @inbounds tip. Still have lots to learn. My bad, Dict is a bad variable name, its actually just an 1d Array

I removed the @inbounds and tried removing array overhead completely:

using BenchmarkTools  
using FunctionWrappers: FunctionWrapper 

struct A{T}
   x::T
end

get_power(a::A{T}, y::T,z::T) where {T} = a.x + y + z

const Fwarpper = FunctionWrapper{Float64,Tuple{A{Float64},Float64,Float64}}

f =  Fwarpper(get_power)
a = A(1.5)

@btime get_power($a,5.6,7.87);
  2.794 ns (0 allocations: 0 bytes)

@btime $f($a,5.6,7.87);
  10.557 ns (0 allocations: 0 bytes)

Now the overhead is ~3.7x, which looks better, but I don’t under why there is any.

thanks.


#4

The type of a.x is also Float64 as far as I can see.


#5

A few things you can do more. Put @noinline on get_power to prevent it from inlining. Also, FunctionWrappers does a null check for precompilation support.


#6

No sure what you wanted to tell me. I am assuming Julia compiler can infer that.


#7

Just tried your suggestions

@noinline get_power(a::A{T}, y::T,z::T) where {T} = a.x + y + z

@btime get_power($a,5.6,7.87);
  4.749 ns (0 allocations: 0 bytes)

So now the overhead is ~2x, probably due to the null check, which I have no idea how to disable. But this means that there is performance penalty to FunctionWrappers of not having inlining. Am I correct ?

Thanks


#8

Yes and that is the whole point of the package.


#9

Understood. Thanks


#10

Sorry, I was confused. Ignore my comment.


#11

This is correct. The main idea of the package is in the julia issue linked in the readme. For this particular property,

In these cases, it is acceptable that the callback function is not/cannot be inlined to the caller but it is also desired to avoid necessary boxing or runtime type check/dispatch.

In another word, the whole point is the you’ll have a single callsite that can be calling different functions with the same signature that you don’t want to specialize on.