Directly modify compiled method without recompilation

Question – how can I evaluate an expression into a method, compile it, and then directly manipulate the compiled version (lower than AST) to create another slightly modified method faster than re-compilation?

I need to squeeze performance for it, as slightly modified methods will run as part of evolutionary algorithm.


Let’s say I have this function:
(for the sake of simplicity)

function calc(myinput::Float32)::Float32
    myplus = Float32(1.0)
    mymul = Float32(2.5)
    return (myinput + myplus) * mymul
end

I want to generate different versions of it, for different values of myplus and mymul.
Note - have different methods, not one method with them as parameters.

The common way is to build an Expr:

function get_calc(myplus::Float32, mymul::Float32)::Expr
    return :(
        function calc(myinput::Float32)::Float32
            return (myinput + $myplus) * $mymul
        end
    )
end

It’s even possible to use RuntimeGeneratedFunctions.jl to making callable methods from Expr in runtime without world-age problems.

However, when profiling and running this for 10M times, profiler says that 99% time is spent on compilation. Which makes sense.

Any way to reduce compilation and edit compiled code directly?


Working end-to-end example which I want to run faster:

using RuntimeGeneratedFunctions

RuntimeGeneratedFunctions.init(@__MODULE__)

function get_calc(myplus::Float32, plusop::Expr, mymul::Float32, mulop:: Expr)::Expr
    return :(
        function calc(myinput::Float32)::Float32
            v1 = myinput
            v2 = $myplus
            v3 = $mymul
            $plusop # v4 = v1+v2
            $mulop  # v5 = v4*v3
            res = v5
            return res
        end
    )
end

get_calc_func(myplus, plusop, mymul, mulop) =
    @RuntimeGeneratedFunction(@__MODULE__, get_calc(myplus, plusop, mymul, mulop), opaque_closures = false)

function find_calc()
    bestres = typemin(Float32)

    for myplusop in [:(v4 = v1 + v2), :(v4 = v1 - v2), :(v4 = v2 - v1)]
        for mymulop in [:(v5 = v4 * v3), :(v5 = v4 / v3), :(v5 = v3 / v3)]
            for myplus in range(-100f0, 100f0, length=100)
                for mymul in range(-5f0, 5f0, length=30)
                    mycalc_expr = get_calc_func(myplus, myplusop, mymul, mymulop)
                    for myinput in range(-10, 10, step=50)
                        res = mycalc_expr(myinput)
                        if res > bestres
                            bestres = res
                        end
                    end
                end
            end
        end
    end
    return bestres
end

@time find_calc()

#   71.370488 seconds (242.33 M allocations: 13.223 GiB, 5.51% gc time, 98.26% compilation time)

Note that plusop and mulop and generic Expr working on any local variable inside calc(), not limited to specific operations, so they can’t be represented by an operation tree etc.

You could perhaps try Julia Interpreter.@interpret to avoid compilation?

1 Like

This might be useful: https://discourse.julialang.org/t/meta-programming-an-if-else-statement-of-user-defined-length/ (also seeing what the package linked in the top post there ended up doing, not sure if its the same as the solution to that post)

1 Like

Thanks!

Trying it and it does reduce compilation time and runtime, at the expense of memory allocation. Will dig deeper into it and update.

Full link for reference: https://juliadebug.github.io/JuliaInterpreter.jl/

Some great references there, will look into it, thanks!

Anything specific that caught your eye, that may directly resolve my specific issue?

Just to clarify – I don’t want to avoid compilation, I want it to happen, but not every little change.
Instead for it to happen once, and then for me to be able to modify the low-level code, the optimized and compiled one.

You might try to figure out what happens inside precompile (and @code_native), and see if you can dig out an address to the compiled code, and modify it with some unsafe_ function (if it’s only a matter of changing some constants). Not very automatic, and highly dependent on internals and versions of llvm, inlining and stuff, but probably doable with some effort.

Are you sure a closure won’t work for you? It seems like that’s what you were trying to do with your get_calc attempt.

function calc(myinput::Float32)::Float32
    myplus = Float32(1.0)
    mymul = Float32(2.5)
    return (myinput + myplus) * mymul
end

function get_calc(myplus::Float32, mymul::Float32)
    return function calc(myinput::Float32)::Float32
		return (myinput + myplus) * mymul
	end
end

using BenchmarkTools
@btime calc($3f0) # hard-coded values, 2.3ns
@btime calcclosure = get_calc($1f0,$2.5f0) # build a closure, 2.3ns
@btime $calcclosure($3f0) # call a closure, 2.3ns
@btime get_calc($1f0,$2.5f0)($3f0) # build *and* call closure, 2.3ns

The last line shows that building the closure is essentially free.

Or perhaps your example was oversimplified and there’s some reason a closure won’t work?

3 Likes

mikmoore’s closure approach does just that. Closures are implemented as hidden functors, the explicit form of which may be more enlightening. There is a struct type defined to contain the myplus and mymul values, and a method can be defined for the type to allow the instances to be called. Only that method needs to be compiled, not every instance/version. Here’s what it looks like:

# closures would hide this type from you, rather they
# do not make an accessible global type name like this.
struct Calc
    myplus::Float32
    mymul::Float32
end

# calc is not a global function name,
# it is a local variable in the method. 
function (calc::Calc)(myinput::Float32)::Float32
    return (myinput + calc.myplus) * calc.mymul
end

calc1 = Calc(1.0f0, 2.5f0) # how a closure is implemented
@time calc1(1.0f0)  # (calc::Calc) method compiles here

calc2 = Calc(2.0f0, 2.0f0)
@time calc2(1.0f0) # reuses compiled (calc::Calc) method

Of course, this only works if your variations are as simple as different values that can be contained as concretely annotated fields of a single struct type. If there are abstractly annotated fields or if the type is parametric (thus multiple), you do open yourself up to more compilation, though it likely won’t compile as often as the OP approach of generating new versions of a method’s expression. Functors aren’t as drastic as changing native instructions.

2 Likes

I was looking for something between native LLVM code, and high level code.

Probably @code_lowered which becomes CodeInfo and MethodInfo, but didn’t find how to directly modify these and make them runnable again. Do you know?

Yeah it was oversimplified, but the important part is that get_calc returns an Expr, which is dynamically built, including the calculation steps.

Exactly, but this was a simplification just for the sake of the example.
In reality, get_calc does get a structure of parameters, but uses that to construct an Expr object. Depending on the parameters, the Expr object is different.

Different enough to be more than just different value inside, but similar enough that a lowered version can be modified directly without going through compilation every time.

Sounds interesting, you should make a different Expr-generation MWE closer to your use case so that we can see how it cannot be refactored by variable-capturing closures/functors or repeated in loops.

I don’t know anything about this to make any suggestions, but this doesn’t sound that simple. A high level language (C and above) is supposed to abstract away the lower-level and far less readable details. LLVM IR (and anything that comes after in the compilation process) has been through a lot of optimization and will not resemble your source code/generated Expr very much. Another complication: unlike source code, compiled code isn’t 1 thing in 1 place, inlining and optimizations can result in a method showing up as possibly different compiled codes in several places. How do you know which to copy and edit to make a variant method?

Now added a working end-to-end example of my current implementation, at the beginning of the original post. Thanks!

Perhaps these could be relevant, but not sure how to use them:

This line in your MWE is wrong and breaks things

I’ll assume you meant range(-10f0, 10f0, length=50) and offer my suggested revision:

function get_calc_closure(myplus::Float32, plusop, mymul::Float32, mulop)
    return function calc(myinput::Float32)::Float32
		v1 = myinput
		v2 = myplus
		v3 = mymul
		v4 = plusop(v1,v2,v3) # (v1,v2,v3) -> v1+v2
		v5 = mulop(v1,v2,v3,v4) # (v1,v2,v3,v4) -> v4*v3
		res = v5
		return res
	end
end

function find_calc(plusops,mulops)
    bestres = typemin(Float32)

    for myplusop in plusops
        for mymulop in mulops
            for myplus in range(-100f0, 100f0, length=100)
                for mymul in range(-5f0, 5f0, length=30)
                    mycalc_expr = get_calc_closure(myplus, myplusop, mymul, mymulop)
                    for myinput in range(-10f0, 10f0, length=50)
                        res = mycalc_expr(myinput)
                        if res > bestres
                            bestres = res
                        end
                    end
                end
            end
        end
    end
    return bestres
end

plusops = [(v1,v2,v3) -> v1 + v2, (v1,v2,v3) -> v1 - v2, (v1,v2,v3) -> v2 - v1]
mulops = [(v1,v2,v3,v4) -> v4 * v3, (v1,v2,v3,v4) ->  v4 / v3, (v1,v2,v3,v4) -> v3 / v3]

@time find_calc(plusops,mulops) # compile everything
  0.262965 seconds (3.35 M allocations: 72.091 MiB, 12.21% gc time, 75.68% compilation time)
@time find_calc(plusops,mulops) # everything compiled
  0.050861 seconds (2.78 M allocations: 42.435 MiB, 12.41% gc time)
@time find_calc([(v1,v2,v3) -> v1 + v2, (v1,v2,v3) -> v1 - v2, (v1,v2,v3) -> v2 - v1],[(v1,v2,v3,v4) -> v4 * v3, (v1,v2,v3,v4) ->  v4 / v3, (v1,v2,v3,v4) -> v3 / v3]) # the subfunctions are not compiled yet
  0.086823 seconds (2.85 M allocations: 46.028 MiB, 3.88% gc time, 45.66% compilation time)
638.0f0

Note that I modified your function to take in the operations as vectors of arbitrary functions (that will crash if they are incompatible with the call site, of course) just to be sure I’m not cheating somehow or hiding the performance of defining these functions somewhat dynamically. Further, I re-make this vector (and the functions) every call so you can ignore the first @time call (which as the extra cost of compiling find_calc and only look at the second call (which only compiles the functions in the vectors and the closure).

Passing the type-unstable (because of the various function combinations) closure into a higher-order function (in this case maximum) creates a “function barrier” and offers some improvement

function find_calc2(plusops,mulops)
    bestres = typemin(Float32)

    for myplusop in plusops
        for mymulop in mulops
            for myplus in range(-100f0, 100f0, length=100)
                for mymul in range(-5f0, 5f0, length=30)
                    mycalc_expr = get_calc_closure(myplus, myplusop, mymul, mymulop)
					bestres = max(bestres,maximum(mycalc_expr, range(-10f0, 10f0, length=50)))
                end
            end
        end
    end
    return bestres
end

@time find_calc2(plusops,mulops) # compile everything
  0.342165 seconds (1.17 M allocations: 54.445 MiB, 1.22% gc time, 97.18% compilation time)
638.0f0
@time find_calc2(plusops,mulops) # everything compiled
  0.008671 seconds (135.00 k allocations: 2.884 MiB)
638.0f0
@time find_calc2([(v1,v2,v3) -> v1 + v2, (v1,v2,v3) -> v1 - v2, (v1,v2,v3) -> v2 - v1],[(v1,v2,v3,v4) -> v4 * v3, (v1,v2,v3,v4) ->  v4 / v3, (v1,v2,v3,v4) -> v3 / v3]) # the subfunctions are not compiled yet
  0.308171 seconds (895.48 k allocations: 40.554 MiB, 1.65% gc time, 97.09% compilation time)
638.0f0

To be honest I didn’t expect the last call to find_calc2 to be slower than the matching find_calc, but I’m not going to dig into that right now.

You can look into the FunctionWrappers.jl package to get slightly better performance for these arrays-of-functions (which otherwise require dynamic dispatch), but I’ll leave that to you and warn you that it’s somewhat advanced.

I’m pushing you away from Expr because I know of no way you’ll get good performance with them. Also, evaluating Exprs like this is a good way make mistakes that yield strange and exciting (and sometimes destructive) errors.

1 Like

How would you generate, in run-time, arbitrary functions?

When you wrote (v1,v2,v3) -> v1 + v2 then this function is defined in compile-time. It works when it’s constant.

The most important distinction in my case, is generating many possible arbitrary such functions in run-time, which is why I got " 98.26% compilation time" as part of my run.

Do you have suggestions on how to improve that?

Generating truly arbitrary functions, not functors? Then evaluating Expr and compiling each one really is the way to do it. If you think about it, you’re not really asking to avoid repeated compilation, you’re asking to skip steps of the compilation. If you somehow copied and edited the LLVM IR, it still has to be compiled to native code. It’s just not that feasible because of how obscured the code gets after all the compilation steps you’re trying to skip.

Let’s say there’s a particular section of the method you’re interested in editing. Maybe it really is as simple as changing a value x = f(0) to x = f(1). Thing is, that value 0 may not even exist by the LLVM IR; the compiler is allowed to identify and evaluate some constants, say changing x = f(0) to x = 3+5im. To edit that compiled code, you have to separately evaluate f yourself and figure out what edit to make. And this is just a simple case, the compiler does a LOT of other things. Skipping those compilation steps just means you have to do them yourself to figure out what edits to make.

1 Like

Here’s a longer but hopefully more representative example:

using RuntimeGeneratedFunctions

RuntimeGeneratedFunctions.init(@__MODULE__)

function get_calc(mysetup::Expr, myalgo::Expr)::Expr
    return :(
        function calc(myinput::Float32)::Float32
            # input always in v1
            v1 = myinput
            
            # init variables
            $mysetup

            # run calculation
            $myalgo

            # result always in v5
            res = v5
            return res
        end
    )
end

get_calc_func(calc_body) =
    @RuntimeGeneratedFunction(@__MODULE__, calc_body, opaque_closures = false)

function find_calc()
    bestres = typemin(Float32)
    bestalgo::Expr = :()

    mysetup = Vector{Expr}()
    for _ in 1:10
        expr_list = Vector{Expr}()
        # without v1, which is input
        for myvar in [:v2, :v3, :v4]
            initval = rand(Float32)
            ex = :($myvar = $initval)
            push!(expr_list, ex)
        end
        # v5 is output
        push!(expr_list, :(v5 = 0.0))
        initvar_expr = quote end
        initvar_expr.args = expr_list
        push!(mysetup, initvar_expr)
    end

    myalgo = Vector{Expr}()
    for _ in 1:100
        algo_inst_list = Vector{Expr}()

        algo_len = rand(3:10)

        for _ in 1:algo_len
            x = rand([:v1, :v2, :v3, :v4, :v5])
            y = rand([:v1, :v2, :v3, :v4, :v5])
            z = rand([:v1, :v2, :v3, :v4, :v5])
            op = rand([:-, :+, :*, :/])
            push!(algo_inst_list, :($x = $op($y, $z)))
        end
        algo_expr = quote end
        algo_expr.args = algo_inst_list
        push!(myalgo, algo_expr)
    end

    for setup_expr in mysetup
        for algo_expr in myalgo
            calc_body = get_calc(setup_expr, algo_expr)
            mycalc_expr = get_calc_func(calc_body)
            for myinput in range(-10, 10, step=50)
                res = mycalc_expr(myinput)
                if res > bestres
                    bestres = res
                    bestalgo = calc_body
                end
            end
        end
    end

    return bestalgo, bestres
end

@time @info find_calc()

How can I copy and edit LLVM IR (or some other intermediate stage, like CodeInfo), but compile to native code every time?
This will save some of the total compilation time, wouldn’t it?