Meta-programming an if-else statement of user-defined length

I am trying to optimize the way my package SymbolicRegression.jl (the pure-Julia version of PySR) calls user-passed operators.

The user passes a list of operators like so:

SymbolicRegression.Options(
    binary_operators=[plus, mult])

Inside this package, an equation is stored as a binary tree. Each node in the tree has an index which corresponds to an operator. For example, a node in the tree with index=1 implies the plus operator, and index=2 implies the mult operator (in this case).

Without going into too many details, I evaluate operators on some data by calling a function and passing the index of the operator.

This function looks like this:

@inline function BINOP!(x::Array{Float32, 1}, y::Array{Float32, 1}, i::Int, clen::Int, options::Options)
    op = options.binops[i]
    @inbounds @simd for j=1:clen
        x[j] = op(x[j], y[j])
    end
end

So, it selects the corresponding operator from the user-defined list, and then calls it.

However, this way of selecting operators seems to be inefficient compared to a having a pre-defined function like so:

@inline function BINOP!(x::Array{Float32, 1}, y::Array{Float32, 1}, i::Int, clen::Int)
    if i == 1
        @inbounds @simd for j=1:clen
            x[j] = plus(x[j], y[j])
        end
    else
        @inbounds @simd for j=1:clen
            x[j] = mult(x[j], y[j])
        end
    end
end

Something about having the operators fixed inside the function which branches over them in an if statement seems to make the compiler happy, and this way of doing it gets a speedup. (note: the Options argument is a immutable struct; so that argument should be fixed for all of the computation)

The Python front-end for my package meta-programs this if-statement by literally printing a string to a file, which is why this Python frontend/Julia backend seems to be faster than the pure-Julia backend.

So, I’ve tried to do the same thing in Julia but have not had luck. I am trying to meta-program this if-statement from inside Julia at runtime, since the user is free to pass any list of functions they desire.

Here is what I have so far:

function constructBinaryOpEvaluator(binary_operators)
    if length(binary_operators) == 0
        return """(x::Array{Float32, 1}, y::Array{Float32, 1}, i::Int, clen::Int) -> begin
    nothing
end"""
    end

    for i=1:length(binary_operators)
        if i == 1
            branch_operators = """
                if i == 1
                    @inbounds @simd for j=1:clen
                        x[j] = $(string(binary_operators[1]))(x[j], y[j])
                    end
            """
        elseif i < length(binary_operators)
            branch_operators *= """
                elseif i == $i
                    @inbounds @simd for j=1:clen
                        x[j] = $(string(binary_operators[i]))(x[j], y[j])
                    end
            """
        else
            branch_operators *= """
                else
                    @inbounds @simd for j=1:clen
                        x[j] = $(string(binary_operators[i]))(x[j], y[j])
                    end
            """
        end
    end

    branch_operators *= """
        end
    """

    return """(x::Array{Float32, 1}, y::Array{Float32, 1}, i::Int, clen::Int) -> begin
$branch_operators
end"""
end

This returns the correct function string that I am looking for (I tried to do this with quote...end but I couldn’t seem to define incomplete expressions to be concatenated). However, when I try to evaluate it as a closure function with the following code:

    plus(x, y) = x + y
    mult(x, y) = x * y
    binary_operators = [plus, mult]
    func_string = constructBinaryOpEvaluator(binary_operators)
    func = eval(Meta.parse(func_string))
    x = randn(Float32, 5)
    println(x)
    func(x, x, 1, 5)
    println(x)

I get this as an error:

Float32[0.53156346, 0.023917956, 0.15273008, -0.33916265, 0.7506737]
ERROR: LoadError: MethodError: no method matching (::var"#1#2")(::Array{Float32,1}, ::Array{Float32,1}, ::Int64, ::Int64)
The applicable method may be too new: running in world age 27803, while current world is 27804.
Closest candidates are:
  #1(::Array{Float32,1}, ::Array{Float32,1}, ::Int64, ::Int64) at none:1 (method too new to be called from this world context.)

So I’m not sure how to move forward - I don’t see a reason for why this shouldn’t work, since I’m running eval on code I know should have worked if I just typed it out literally.

Any idea what to do?
Thanks,
Miles

1 Like

In your initial example, op is unknown function (since it depends on i) and as such return type can’t be inferred. So Julia have to box everything and performance degrades. I think, function barrier should help here

function BINOP!(x::Array{Float32, 1}, y::Array{Float32, 1}, i::Int, clen::Int, options::Options)
    op = options.binops[i]
    BINOP!(op, x, y, clen, options)
end

function BINOP!(op::F, x::Array{Float32, 1}, y::Array{Float32, 1}, clen::Int, options::Options) where F
    @inbounds @simd for j=1:clen
        x[j] = op(x[j], y[j])
    end
end

I am not sure, whether where F is required here, suppose it can be avoided.

Anyway, it’s worth to verify code with @code_warntype

Thanks for the quick response! This is great, it gets performance down quite a bit. However, it is still 10x slower than the explicit branching for some reason. Any other ideas?

Here are the current numbers:

Using a list: 4884 ns
Your function barrier: 103 ns
if-else statement: 10 ns

The function barrier is definitely the right approach. Not sure the exact benchmark you’re running, but here’s a MWE that may be equivalent to what you need that shows it works:

binops = [+, -, *]

function BINOP!(x, y, i, binops)
    op = binops[i]
    broadcast!(op, x, x, y)
end

clen = 2^20
x = rand(clen)
y = rand(clen)

@btime BINOP!($x, $y, 1, $binops)   # 1.191 ms (0 allocations: 0 bytes)
@btime $x .= $x .+ $y   # 1.208 ms (0 allocations: 0 bytes)

In this case, I’m just using broadcast as the function barrier, which fyi already does a @simd @inbounds for you, so you may not need to write out the loop by hand. If you need to do it only a specific part of the array, I think @views broadcast!(op, x[1:clen], x[1:clen], y[1:clen]) should work and not lose any performance (haven’t checked explicitly though).

1 Like

Can you provide an MWE?

10ns is very small value, looks like arrays are very small and overhead from barriers shadows actual calculations. Is it always the case? For large enough arrays difference should be negligible, so further actions depends on the use case.

Thanks, this is quite helpful! The broadcast seems to help as well, now the list one works just as fast. I’ll do a slightly longer array and put my benchmark below.

I note that for this sort of algorithm, the arrays are very small. Even ~100 in size is typical. This is sequential so you can’t really batch over all of them at once unfortunately.

Benchmark:

using BenchmarkTools

function barrierBINOP!(x::Array{Float32, 1}, y::Array{Float32, 1}, i::Int, clen::Int, binops::Array{Function, 1})
	op = binops[i]
	barrierBINOP!(op, x, y, clen, binops)
end

function barrierBINOP!(op::F, x::Array{Float32, 1}, y::Array{Float32, 1}, clen::Int, binops::Array{Function, 1}) where F
    broadcast!(op, x, x, y)
end

function listBINOP!(x::Array{Float32, 1}, y::Array{Float32, 1}, i::Int, clen::Int, binops::Array{Function, 1})
    op = binops[i]
    broadcast!(op, x, x, y)
end

function ifBINOP!(x::Array{Float32, 1}, y::Array{Float32, 1}, i::Int, clen::Int, binops::Array{Function, 1})
    if i == 1
        broadcast!(plus, x, x, y)
    else
        broadcast!(mult, x, x, y)
    end
end

function plus(x::Float32, y::Float32)::Float32
    x + y
end
function mult(x::Float32, y::Float32)::Float32
    x * y
end

binary_operators=[plus, mult]
x = randn(Float32, 50)

println("Barrier:")
println(@benchmark barrierBINOP!(x, x, 1, 50, binary_operators))
println("List:")
println(@benchmark listBINOP!(x, x, 1, 50, binary_operators))
println("If statement:")
println(@benchmark ifBINOP!(x, x, 1, 50, binary_operators))

Output:

Barrier:
Trial(53.581 ns)
List:
Trial(51.089 ns)
If statement:
Trial(33.527 ns)

So it looks like this is pretty good! Still not quite at the if-else statement speed though.

Is there any way I can get the meta-programmed version to work or is that too difficult?

When I tried to reproduce the result, I guess I figured out another possible source of slowdown, it’s Options type.

With overall setup

function BINOP!(x::Array{Float32, 1}, y::Array{Float32, 1}, i::Int, clen::Int, options::Options)
    op = options.binops[i]
    BINOP!(op, x, y, clen, options)
end

function BINOP!(op::F, x::Array{Float32, 1}, y::Array{Float32, 1}, clen::Int, options::Options) where F
    @inbounds @simd for j=1:clen
        x[j] = op(x[j], y[j])
    end
end

function BINOP2!(x, y, i, clen, options)
    if i == 1
        @inbounds @simd for j=1:clen
            x[j] = (+)(x[j], y[j])
        end
    end
end

x = rand(Float32, 100)
y = rand(Float32, 100)
clen = length(x)

julia> @btime BINOP2!($x, $y, 1, $clen, $options)
  7.923 ns (0 allocations: 0 bytes)

I have tried three versions
1.

struct Options
  binops
end

options = Options([+, -, *])
julia> @btime BINOP!($x, $y, 1, $clen, $options)
  44.011 ns (1 allocation: 16 bytes)
struct Options{T}
    binops::T
end
options = Options([+, -, *])

julia> @btime BINOP!($x, $y, 1, $clen, $options)
  31.602 ns (1 allocation: 16 bytes)
struct Options{T}
    binops::T
end
options = Options((+, -, *))

julia> @btime BINOP!($x, $y, 1, $clen, $options)
  23.229 ns (0 allocations: 0 bytes)

So, of all of this versions, fully typed tuple version of operations is the fastest. I think the difference 23 - 8 = 15 ns is the price that one have to pay to figure out op dynamically and it can’t be improved further.

If this performance is not enough then closest to this issue is union-splitting and write macro along the lines of ManualDispatch.jl or this snippet. But for it to work, all operations should be defined at compile time.

1 Like

Okay, thanks for your help! This was really useful. I will also take a look at that package you mentioned - indeed the operations are often be defined at compile time (essentially what the Python meta-programmed version does).

One more thing, maybe it’ll help. It looks like the central (and most heavy computational) function of your package is RunSR, which in turn has niterations option. If it possible to rewrite your algorithm in a way that BINOP! function would be called multiple times for the same i, then issue of dynamical dispatch will be complete gone. I mean, if it is possible to do something like

function RunSR(x, y, niterations, options)
# ... some preliminary calculations
  BINOP!(x, y, niterations, i, options)
# ... some preliminary calculations
end

function BINOP!(x::Array, y, niterations, i, options)
  op = options.binop[i]
  BINOP!(op, x, y, niterations)
end

function BINOP!(op, x, y, niterations)
   for k = 1:niterations
        broadcast!(op, x, x, y)
        # ... do some other calculations
   end
end

It’s just an idea, but I hope it is clear - push dynamic dispatch as early in your calculations as possible and proceed to work with well defined types. This way you can avoid meta programming completely.

2 Likes

If this level of overhead on an inner loop isn’t good enough, fundamentally what you need to do is push the information about the operators into the type system on a more-outer loop, where you dont care about the overhead. E.g. here’s a version where the operator which is selected is fully known to the compiler because binops is a Tuple{typeof(+),typeof(-),typeof(*)} rather than a Vector{Function}, and because the index is now inside a Val:

binops = (+, -, *)

function listBINOP!(x, y, ::Val{i}, clen, binops) where {i}
    op = binops[i]
    broadcast!(op, x, x, y)
end

@benchmark listBINOP!($x, $x, Val(1), 50, $binary_operators)

This is exactly as fast as the if-version on my machine, I suspect all of the op selection stuff gets compiled away entirely. More-or-less, you’ll need to do something like this on a more-outer loop. For this problem, it sounds like anything you could do with meta-programming you could do more cleanly by just giving the Julia compiler the type information it needs, so I’d definitely recommend that.

6 Likes

Unfortunately the loop is a genetic algorithm and everything needs to be ran sequentially. New equations are being generated very quickly and it would be hard to know what operators the equations contain until actually traversing through them inside the iterate loop. But maybe there are more things to try to batch over the arrays together…

I also tried using an LRUCache on these, since often times similar parts of the equation are being computed multiple times, but it was difficult to get working.

The other bottleneck has to do with allocating the temporary arrays during the array evaluation; I think there is some better traversal order of the binary tree to get more memory efficiency but I haven’t gotten it working yet.

1 Like

Thanks, this looks like a really nice way to do it!! This is my first time seeing ::Val{...} - that looks like a perfect solution to the branching issue.

Thanks again, both of you, for the amazing help.
Cheers,
Miles

Strange. Theoretically you can parallelize all the crossovers and fitness evaluations in the same generation of a genetic algorithm.

Misspoke, sorry. Did not mean literally everything.

The main internal parallelization is over populations. Each process has its own population of equations which evolves independently. These are mutated with regularized evolution, and every certain number of generations, the processes asynchronously communicate their best population members to a server which then “migrates” the fittest equations between populations.

I also have an option to turn on multi-threading within each population of equations, but this doesn’t add much extra speed, and produces a slightly different algorithm since equations are no longer randomly sampled.

Theoretically, turning on the multi-threading with the population should not change the algorithm, unless you mean you will need to have multiple randomness streams (to avoid threads locking the global RNG all the time), but even so the results should be reproducible if the code is correctly written (a seed is passed and its stream is used to initialize any other streams needed).

The only reason I see for not adding too much extra speed is if you already have a process for each core (i.e., each population inhabit a core). Otherwise, the crossovers and fitness evaluations should be all independent, and therefore the problem should be embarrassingly parallel.

I will have to get into some details to explain why this changes the algorithm.

The way regularized evolution works is as follows:

  1. You have m independent populations of n individuals each. Consider one population at a time:
  2. At every new generation, you sample ns << n individuals (e.g., ns=10; n=100).
  3. You select the best individual (highest fitness score) of those ns.
  4. You duplicate that individual, and mutate its duplicated copy.
  5. Take the mutant individual, and replace the oldest individual of the entire n-size population.
  6. Repeat steps 2-5 for some number of generations.
  7. Repeat steps 1-6 for m different populations, completely independently.
  8. Collect the best individuals from each population into a group.
  9. Randomly sample from this group and “migrate” those individuals into each of the m populations randomly, replacing random individuals.

Now, step 7 is very easy to parallelize. I do this with multi-processing so that I can split it over clusters.

I also have parallelism for step 2-5 with multithreading within each process. However, parallelizing over the equations means (I think) you change the algorithm.

The way I parallelize is by shuffling the individuals, splitting them up n/ns chunks, and evaluating each chunk in parallel with multithreading. Then I take the mutants and replace the oldest n/ns equations.

But notice this is a different algorithm. It seems to work, but the sampling method has changed: you go from sampling with replacement to sampling without replacement.

Does this make sense?

1 Like

Yes, I do understand, and yes if you gonna follow a very strict variant of some genetic algorithm maybe parallelism is not easy to implement or even possible.

However, I believe it is possible to parallelize steps 2-5 in a way it is faithful to the algorithm and maybe faster (depending how many cores you have).

One way is to be lazy. Sample the first group from a list of numbers between 1 and 100 (the indexes of the population in age order). Start a thread to do steps 3-5; before it returns start a new thread to do the same as the old thread, with one difference, if it samples number 1 (newest member) it needs to wait for the first thread. Then start a third thread, but if it needs number 2 then it needs to wait for the first thread, and if it needs number 1 it needs to wait for the second thread, and so on.

The other way is to be eager. If you would do 2-5 for x generations, then do x samples immediately (maybe in parallel) but over indexes, as pointed above. Solve a small maxspan problem: if one sample needs a previous sample to be done (because it needs to sample a very fresh member of the population) then this previous sample it depends on must be done as soon as possible to allow the other to run, and if this depends on an yet previous one then put this one first, recursively. Basically, assemble the dependence graph between the samples and run at each moment the largest number of samples that you can (i.e., all that do not depend on a previous one, prioritizing the ones that allow you run more samples after).

1 Like

Ah, I see what you mean. That is a really nice idea!! I didn’t think about this way of doing things at all; I like that approach.

For reference, here is where steps 2-5 occur in the code: SymbolicRegression.jl/RegularizedEvolution.jl at ea790b019ea46022a9cdfd64b06ee653bd19be5e · MilesCranmer/SymbolicRegression.jl · GitHub.

options.fast_cycle uses the simple multithreaded version, splitting up a population into chunks. e.g., shuffle!(pop.members) shuffles the members of the population before splitting into independent groups which get mutated.

1 Like