Why is usage of llvmcall so restricted?

aramend · June 6, 2024, 11:18am

Hi all. I would otherwise post in Internals & Design but I’m not allowed to. So I’ve looked around the discourse and elsewhere but would still like to better know why llvmcall works the way it does, and if there are other better escape hatches somewhere.

My current bottleneck is compile-time overhead, which is why I’m exploring intervening at a lower level.

The following is verboten in Julia:

#llvmtest.jl
using Base: llvmcall

get_ir(x, y) = """
%1 = add i32 $x, $y
ret i32 %1
"""

function llvm_add(x, y)
    ir = get_ir(x, y)
    llvmcall(ir, Int32, Tuple{})
end

@show llvm_add(2, 3)

#julia llvmtest.jl
ERROR: LoadError: error statically evaluating llvm IR argument
Stacktrace:
 [1] llvm_add(x::Int64, y::Int64)
   @ Main ~/projects/julia-learning-internals/llvmtest.jl:19
 [2] macro expansion
   @ show.jl:1181 [inlined]
 [3] top-level scope
   @ ~/projects/julia-learning-internals/llvmtest.jl:22

I can fix this toy example with generated, but that’s a class of solutions I’m trying to avoid because it will end up in exploding method tables.

@generated function llvm_add(::Val{x}, ::Val{y}) where {x,y}
    ir = get_ir(x, y)
    return quote
        llvmcall($ir, Int32, Tuple{})
    end
end

@show llvm_add(Val(2), Val(3))

1) Why does this limitation exist?

I get that I might be breaking some internal guarantees about what methods exist and so on, but in the end I just want to create a function by mashing together separate bits of LLVM IR I have laying around into a “meta-function” and execute that many times, quickly. The hope is to avoid (some of) the compile overhead of the “meta-function”, if I already have the IR available for the individual functions I compose in the “meta-function”.

2) Given my aims, are there other escape hatches than llvmcall that could be useful?

abraemer · June 6, 2024, 11:32am

I can’t really comment on llvmcall but I feel that we have an xy-problem type situation here.

Before jumping to any conclusions, why don’t you tell us about your problem first? Then we will see how to remedy it together

A small comment on your experiment: Even if your llvmtest.jl would work, it would be horribly inefficient! Just think about what needs to happen on each call to llvm_add: A string is constructed, that then needs to be parsed and translated to llvm-instructions, which then need to be assembled and then finally be run. That’s effectively equivalent to reading/parsing/compiling this function on every call! I don’t think that you can ever improve compile times this way.

aramend · June 6, 2024, 12:03pm

Evaluation of millions of different mathematical expressions that are generated at runtime. There are various approaches, one constraint I care about is writing the output of every node in the expression to a vector. In polish notation, the output of the expression [*, squared, x, y] could be [16, 4, 2, 4] for instance. Here you would do for loops, which is nice and fast, but you lose some type information (and therefore speed) by needing abstract type Function. Things can be made fully static using approaches like [ANN] New package: CallableExpressions v1.0.0, where all of the expression goes into the type parameters of your structs. Super-fast to evaluate, but very slow to compile, ~100ms for a medium size one, which makes millions of expressions untenable. Not to mention exploding method tables. These math expressions are unlikely to be used again so I’d love to know how to avoid them being stored in the method table.

So, the LLVM IR for the individual functions is easy. And I thought to just squish it together to make different expressions possible, thereby getting the good performance but hopefully avoiding some compile time. I got it to work, but hit the problem in my original post. Runtime generated functions avoid the dynamic problem, but seem to add more compile overhead.

That just means llvmcall is the wrong approach, but the general gist is that in the chain julia code -> lowered -> typed -> llvm -> native, we could maybe avoid some compile overhead if we attack later in the chain. You’re right about the string parsing, there is probably another object (CodeInfo?) that is less intensive to get LLVM IR out of.

nsajko · June 6, 2024, 2:15pm

Then why compile specialized code for the expression? Just interpret them.

aramend · June 6, 2024, 2:24pm

I’m optimizing parameters in each expression using gradient descent, so I evaluate each expression some 20k times. But after that, I’m unlikely to call that evaluation again, hence a desire to unclutter the method table. But Base.delete_method won’t work, since that operates on type families, and the clutter happens in the parametrized types in this case.

abraemer · June 6, 2024, 6:49pm

You seem to describe different things:

You have some runtime generated function you need to call somewhat often. You want both: performant execution and no compile times.
you want to clean up the method table. Why? Is that really an issue? (Honest curiosity)

The first problem seems like a classic problem about amortized cost: if want the absolute minimal runtime you need some compilation. It probably would be best to just make a few “typical case” measurements to determine whether the one time compilation cost is worth the increased speed in runtime. You can likely make a very fast “interpreter” solution that is not orders magnitude slower than a fully compiled function.

I don’t see how you could end up with a fully optimized, native function without calling into llvm which likely is the slowest part of the compilation (someone with more knowledge should comment on this). So if you say that compilation time dominates, then it’s probably not worth it to try to make compilation faster and a more fruitful approach to make a fast interpreter. And if compilation does not dominate then just do that
That’s my view on this but as I said: I am not expert on compiler things.

nsajko · June 6, 2024, 8:18pm

One thing I’d like to emphasize is that “where all of the expression goes into the type parameters of your structs” is not necessarily true, neither in general nor for CallableExpressions.jl. You might be making a “false dillema” logic error here, where you compare a solution with worst case compilation time to a solution with worst case run time, instead of also looking into the middle ground.

Possibly @abraemer’s suggestions should be revisited:

This is important information:

I guess that 20k is a large enough number to make it worthwhile to compile specialized methods for each expression, without moving all data into the type domain. Perhaps it’d be good if you could create a, hopefully representative, smallish example that’s easy to benchmark, so we could try to optimize it (“nerd snipe”). If you do create such a nerd-snipeable example, I suggest posting a new topic/thread. The topic already veers far from the title.

Kinda speculative, but Julia actually allows a lot of freedom around making custom method tables, compilation, and the “abstract interpretation” stuff (whatever it is). Although note that I’m not sure how much of this is supported/guaranteed to be compatible after updating Julia. If you’re really interested in exploring creating custom method tables, perhaps look into the GPUCompiler.jl implementation. Contrary to the “GPU” in the name, the package is also used by, e.g., StaticCompiler.jl, it’s not specific to merely GPU programming.

FWIW I think calling into LLVM becomes much cheaper when lowering the optimization level.

ericphanson · June 6, 2024, 8:31pm

Have you seen GitHub - SymbolicML/DynamicExpressions.jl: Ridiculously fast symbolic expressions btw?

mikmoore · June 6, 2024, 9:16pm

I think I would try the recommended DynamicExpressions.jl first, as that looks pretty powerful and ergonomic.

But if the above solutions don’t work for you, you can also consider FunctionWrappers.jl.

julia> using FunctionWrappers: FunctionWrapper

julia> plus_f64 = FunctionWrapper{Float64, Tuple{Float64, Float64}}(+);

julia> times_f64 = FunctionWrapper{Float64, Tuple{Float64, Float64}}(*);

julia> typeof(+) == typeof(*)
false

julia> typeof(plus_f64) == typeof(times_f64)
true

julia> plus_f64(3.0, 4.0), times_f64(3.0, 4.0)
(7.0, 12.0)

FunctionWrappers lets you make the type of functions based entirely on their (concrete) input/output types, rather than the function itself. This way, you can store and call many functions in a stable way. With that, you should be able to write code to assemble and evaluate runtime expressions while avoiding eval or even dynamic dispatch.

nsajko · June 6, 2024, 10:21pm

A little dynamic dispatch seems like it’d be worthwhile given the relatively high number of required evaluations.

aramend · June 7, 2024, 10:58am

Thanks for all the helpful input!

I see how this looks like the XY problem, so I’d like this thread to be about the Y part: understanding why llvmcall works the way it does. According to an admittedly old lecture by Jeff Bezanson, compile time is about a 50/50 split between lowering in julia and native code generation. So it should stand to reason that if you already have snippets of IR available, gluing them together and shipping them to LLVM should compile faster than starting out from Julia code, especially with -O0.

About the actual problem, X, a many people mention DynamicExpressions.jl. This is likely the best middle ground solution. I can’t shove my problems through there directly, but the type-stable operator enum trick is the clever learning I’ll make use of. I’ll also check out FunctionWrappers.jl, I was unaware of it. Looks like understanding how it works will teach me a lot about julia internals.

mbauman · June 7, 2024, 2:37pm

@abraemer answered this quite well in the first response here:

Perhaps another way to think about this is that a hypothetical llvmcall with support for runtime strings is effectively an eval∘parse of a string with code. Regardless of the language, it’s a complete black box to the compiler ahead of time. This leads to many pessimizations around that black box in addition to the overheads of the string construction and parsing.

Finally, even if this would work as you had hoped and reduce your overhead by that theoretical 50%, is a 2x improvement really going to save your bacon here? It’s highly likely that half of an intractable problem is still intractable. So that’s why folks are looking to solve X.

aramend · June 7, 2024, 2:56pm

Ah sorry! I now worry I’ve been confusing people with a mistake in the example… my bad, very sorry. Of course parsing and compiling on every call is bad, instead return a function of no arguments.

function llvm_add(x, y)
    ir = get_ir(x, y)
    () -> llvmcall(ir, Int32, Tuple{})
end

f = llvm_add(2, 3)
f()

This is still not allowed in julia, but if it were, I would hope for it to compile faster (Maybe not for this toy example, but for more complicated IR).

FunctionWrappers.jl is my solution to X for now and I marked this thread as solved. It really is just about understanding why the above is not allowed.

Topic		Replies	Views
Directly modify compiled method without recompilation Performance compilation	25	1075	August 31, 2022
Codegen woes Performance	30	2258	August 27, 2019
Can you call Julia methods with LLVM call? Performance	15	2126	October 1, 2022
Save dispatched function for later call General Usage	46	1668	September 1, 2021
Parsing expressions into functions Performance	22	1903	June 11, 2019

Why is usage of llvmcall so restricted?

Related topics