Save dispatched function for later call

I agree with @Sukera, you are just trying to implement a interpreter (or a virtual machine) over an extremely simple arithmetic language. We can do this easily in Haskell with < 100 line codes. If you believe this counts as a computation graph, you might just waste your time.

It’s indeed possible to twist function pointers in Julia, just like FunctionWrappers.jl, to achieve a more general interpreter (for arbitrary types and functions). And if you know how to do this in C/C++, then you can do this kind of low level thing in Julia (you just need to use Julia’s internal). But this is not easy and beneficial. We already have a task scheduler. You can directly spawn the computation and let Julia do it. Task indeed has some overheads and may not be appropriate if we have a lots of small computation. But if the computation is small, then we shouldn’t spawn it at all (besides, a dynamic interpreter also has some small overheads, because we need to pack and unpack parameters to execute an opcode) .

Instead, you should look at into Debugger.jl or Julia’s abstract interpreter framework. These works are much closer to what you are doing here.

PS: If you really really want to know how to get function pointers, use which(m,tt) to get Method object. The specializations field of Method stores MethodInstance for this call. Then the cache field of MethodInstance stores CodeInstance, which is just like a handle. The specptr field of CodeInstance stores the function pointer of the function. The invoke field also stores a pointer. The difference is that the pointer in invoke is a wrapper of the pointer in specptr. It has a signature of void* f(void* tparams,void** args, int32 nargs). This signature is the same for all different functions. So you can ccall this function in an uniform way (there is no dispatch since the type of inputs is fixed.)

Yes, sorry. These are just sample data, these will be very big arrays or in some case tensors in development and production.

I understand haskell would mean less than 100 line of codes for this specific case. But I have the our company research codebase in julia and it would be really waste of time to port everything seamlessly. Also if it is possible and fast, why would we mix languages.

Thank you for all the points I will try to check them!

Sadly FunctionWrappers.jl wasn’t enough clear for me and I coudln’t apply appropriately. :frowning:

why not take the representation you have and “compile” it to Julia expressions (ie. write a macro), and then let the Julia compiler compile that to machine code? If you “know” the types of the inputs, then you simply write the Julia expressions with type annotation.

a = 1
b = 2
foo = Expr(:call, :+, a:Int, b::Int)

Instead of simple expressions you could generate functions that carry out the operations of interest.

The problem here is the speed, that way is extremely inefficient.
Eval shouldn’t be used this many time because of it is time consuming.

using BenchmarkTools

function f(ops)
  for op in ops

function g(ops, a, b)
  for op in ops
    +(a, b)


plus = Expr(:call, :+, a::Int, b::Int)
ops=[plus, plus, plus]

@btime f($ops)
@btime g($ops, $a, $b)

For me it show:
163.592 μs (99 allocations: 6.56 KiB)
1.828 ns (0 allocations: 0 bytes)

Maybe this is the important point to resolve? As far as I can tell, FunctionWrappers.jl is the exact right solution to your original problem. For example:

julia> using FunctionWrappers: FunctionWrapper

julia> ops = [+, *]
2-element Vector{Function}:
 + (generic function with 190 methods)
 * (generic function with 328 methods)

julia> wrapped_ops = [FunctionWrapper{Int, Tuple{Int, Int}}(op) for op in ops]
2-element Vector{FunctionWrapper{Int64, Tuple{Int64, Int64}}}:
 FunctionWrapper{Int64, Tuple{Int64, Int64}}(Ptr{Nothing} @0x00007fe24935fed0, Ptr{Nothing} @0x00007fe297964a78, Base.RefValue{typeof(+)}(+), typeof(+))
 FunctionWrapper{Int64, Tuple{Int64, Int64}}(Ptr{Nothing} @0x00007fe249360190, Ptr{Nothing} @0x00007fe297964a80, Base.RefValue{typeof(*)}(*), typeof(*))

julia> using BenchmarkTools

julia> @btime (for i in 1:2; z[i] = $ops[i](x, y); end) setup=(x = 1; y = 2; z = zeros(Int, 2))
  60.315 ns (0 allocations: 0 bytes)

julia> @btime (for i in 1:2; z[i] = $wrapped_ops[i](x, y); end) setup=(x = 1; y = 2; z = zeros(Int, 2))
  12.344 ns (0 allocations: 0 bytes)

wrapped_ops is a concretely-typed Vector of function wrappers, so we can iterate over it and called the wrapped operations much more efficiently.

If you find the FunctionWrapper syntax awkward, I wrote a macro here: Can FunctionWrappers.jl express higher order functions? - #4 by rdeits which might help:

wrapped_ops = [@fn((Int, Int) -> Int)(op) for op in ops]

or, equivalently:

wrapped_ops = @fn((Int, Int) -> Int).(ops)
1 Like

I still don’t really understand how you’re planning to use this. If you’re planning to execute each input exactly once, then the compilation time will be unacceptable. If you’re planning to take some user input, and then do some intense calculations with it, reusing the operation over and over in a loop, then the compilation time will be less important, and the fact that the operations are fast will be advantageous.

@rdeits This is exactly what I was looking for indeed. I can’t believe this really exists already. :smiley: Once I really have to give back to the community, this is just crazy…

@dlakelan the function is already compiled. The key is to call the appropriate one, but I want to spare the dynamic dispatch. This is hyper optimalisation… sort of. But actually dispatching is very pricy so eventually this is like a 100x speed up in this simple case.

1 Like

I have only one observation yet to note.

If I compare speeds:

function l(ops, a, b)
	for op in ops

ops_s=[:ADD, :ADD, :MUL, :ADD, :SUBS, :SUBS, :MUL, :MUL]
opp=[+, +, *, +, -, -, *, *]
w_op=[FunctionWrapper{Int, Tuple{Int, Int}}(op) for op in opp]

@btime $l($opp, $a, $b)
@btime $l($w_op, $a, $b)

function h(ops, a, b)
	for op in ops
		if op == :ADD
			+(a, b)
		elseif op == :SUBS
			-(a, b)
		elseif op == :MUL
			*(a, b)
		elseif op == :DIV
			/(a, b)
@btime $h($ops_s, $a, $b)

The results:

   155.907 ns (0 allocations: 0 bytes)
   40.175 ns (0 allocations: 0 bytes)
   3.876 ns (0 allocations: 0 bytes)

I notice we do at least 2,3 more if in the case of h(…) functon calls and it is sort of scaling linearly with the type of operation calls. But we still get a fairly better speed, do we have still some speed lose during the FunctionWrapper calls?

h(...) is a compilable function, that is why it is really fast I know… but isn’t the i(...) with the wrapped version should be fine? Where do we lose speed with the FunctionWrapper version? Shouldn’t it be comparable in speed?

Yeah, some loss is unavoidable here. The problem is that a FunctionWrapper call can’t be inlined (just like a function pointer in C/C++), so it won’t be as fast as a normal function call which can be inlined. This is exactly the same problem that virtual methods have in C++.

1 Like

It’s still unclear why you need this. At the beginning of this post, you said you are trying to design a fast computation graph (I guess it’s a static one, since people use other names for a dynamic one). That’s why I said you should just compile your computation graph as a whole. Yes, you have already compiled each individual function, but you can still compile them together. For example, you have already compile f, and you want to invoke f in a loop, then you can just compile the loop. This is the right way to avoid all the overheads, as long as you can get the full computation graph.

However, the code sample you presented before is not a (static) computation graph. It’s an virtual machine (or interpreter) over an sequence of instruction. This is a totally different thing. So we need to perform runtime dispatch (since it’s impossible to get the full code). But it’s really expensive (compared to compiled version), since you have to decode opcode, set up call frame on stack and so on. We can do this by using function pointers in Julia. To avoid these costs, many virtual machines have specialized opcode for some frequently used operations, like boolean and arithmetic operation. If you try to interpret a loop, you will be faster than dynamic dispatch this way, but still lose a lots of performance. So people sometime design a bytecode compiler before interpreting codes. It will compile the code to a lower form of code (much like machine code) instead of directly using an embedded interpreter. These codes are much cheaper to interpret (but still much slower than a static compiler)

It’s possible to come up with a hybrid of static compilation and dynamic interpretation, regardless of the graph is static or dynamic. That’s what Julia’s REPL and other profiling JIT compiler are doing under the hood. It will perform interpretation for unimportant codes and compile those “hot” code. But implementing this is not that easy (require much more non-trivial Julia knowledge).

The core problem here is that how you get the “computation graph” here and how low level it is. If you can statically get the graph and control how they are generated (which I believe this is the case), you should use compiler. For example, if one is trying to design a node editor system like blueprint in blender, then he should translate the graph to Julia’s AST (or lowered IR) and compile it before run. If it’s fully dynamic (though I really suspect this), then you have to use an interpreter.


To be honest, from a purely semantic POV, what you’re asking about should be impossible (if there were no access to individual method instances thanks to julias’ reflection capabilities). The reason for this is pretty simple: julia is a dynamic language, meaning in theory/semantically, julia “does” a method lookup for each call. It’s just that due to the existence of type inference and static compilation of inferred code enabled by a limited eval, there really are compiled method instances to look at, even though semantically julia “does” a lookup. If julia were a static language, you wouldn’t have the luxury of just putting a generic + in a Vector and calling it (well not in any sane language with properly typed functions and not just function pointers! That’s just a disaster waiting to happen) - you would have to specify which + method you want there and basically implement the dynamic dispatch yourself.

All that’s just a cherry on top of the dynamic computation graph arguments being made here as well though.

1 Like

@rdeits thank you! What I believed is that I just call directly the dispatched function in each time, so I just save the pointer where to go and don’t waste time on dispatch. In case of + and - arithmetic the cost of the function.

@ChenNingCong , yeah you are right this is eventually a virtual machine or interpreter that just fluidly does what the computational graph describe, which would be just pointer to function and their inputs to call.
The way I get the computation graph is just random at the moment. I generate from 4 operation [+, -, *, /], by time I will controll the creation of the computation graph.

@Sukera Thank you for those clarification. Yeah, I don’t know if there is a chance to just call a function like instantly due to I already have the pointer to the compiled code to call ready in the list. I feel like this is really low level but… is it possible to just move the program run pointer to the compiled function with the inputs and call… :smiley:

Do you guys think it is possible to call a function without any dynamic dispatch and just assembling function like fn_ptr + [inputs] then call…?

That’s exactly what FunctionWrapper is doing–it’s actually a function pointer under the hood, just like in C/C++. It just turns out that function pointers have a small (few nanosecond) overhead. That’s not because function pointers themselves are slow but simply because calling any function has a nonzero cost if that function is not inlined.

1 Like

As someone working in computer security, I will now insert my horrified face:


If you’re willing to dig very deep into julia internals and you’re not afraid of llvmcall and ccall - sure! :smiley: But at that point, why use julia in the first place? llvmcall and ccall is very powerful, but it also means you’re not going to be able to leverage a lot of things julia has to offer.

Like I mentioned, if you want to avoid dynamic dispatch at all costs - julias’ semantics are kind of going against that (the implementation of how it works is a different matter).

@rdeits FunctionWrapper.jl talks with julia internals for the user, right? What happens when those assumptions/usages break, as the internals are free to change with any julia version?

1 Like

So basically FunctionWrappers is a cfunction which introduces a compilation barrier, so for example we can’t propagate constants into the wrapped funtion from its const call arguments.

What I would also consider is using opaque closures, aren’t they pretty much the new native way of calling fixed functions? :slight_smile:
I hope they are less likely optimization barriers, but probably @Keno knows how much use it would have for this problem.

What really caught my eyes is that how could a switch case be so fast in your benchmarks ~10x faster?. :open_mouth:

1 Like

@rdeits I just realised what you are talking about, sorry. But does the inlining really yields then a 10x speedup? :o This is just nonsense… Something just strange for me. How can a function pointer jump give this big speed drawback… Damn :o

Yeah that 10x is damn crazy.

There are other people here who are far more qualified than I am on this topic, but basically: yes. When the operation is as simple as adding two integers (a single CPU instruction), then the relative cost of a function call can be large. Inlining is important…

For me it is only interesting because as I measured in my case the indirection is in case of array 0.7-0.9ns and in case of dict it is 3.5-5.5ns. I know this is something different… dict is much more complex, then how does this becomes 7ns (in case of 1 length array at the example Save dispatched function for later call - #22 by Marcell_Havlik)… it is just interesting, (the relatív difference is around 30x when more lengthy arrays are used).

Reason #1 for this is that inlining allows other optimizations. For example, if your function is +, inlining might allow loop re-ordering, which can let the compiler replace scalar addition with vectorized addition instructions, which is a few times faster.