Executing array of functions much slower than executing functions in series

I have a very performance sensitive application that needs to activate a neural network on a per row basis for a large number of rows.

Consider a network with 4 input values and 1 output value, with a few layers in between:

image

I can define this as 5 independent calculations to be done in order, each one optimised for a specific number of inputs (only optimising for 2 to 4 here):

function compileLayer(inputCols, weights, activationFunction, outputCol)
    if length(inputCols) == 2
        (array, row) -> array[outputCol, row] = activationFunction(array[inputCols[1], row]*weights[1] + array[inputCols[2], row]*weights[2])
    elseif length(inputCols) == 3
        (array, row) -> array[outputCol, row] = activationFunction(array[inputCols[1], row]*weights[1] + array[inputCols[2], row]*weights[2] + array[inputCols[3], row]*weights[3])
    elseif length(inputCols) == 4
        (array, row) -> array[outputCol, row] = activationFunction(array[inputCols[1], row]*weights[1] + array[inputCols[2], row]*weights[2] + array[inputCols[3], row]*weights[3] + array[inputCols[4], row]*weights[4])
    end
end

l1 = compileLayer([1,2,3], [1.0,2.0,3.0], sin, 5)
l2 = compileLayer([1,2,3,4], [1.0,2.0,3.0,0.5], sin, 6)
l3 = compileLayer([5,6], [1.0,2.0], sin, 7)
l4 = compileLayer([5,6,7], [1.0,2.0,1.5], identity, 8)
l5 = compileLayer([7,8], [1.0,2.0], sin, 9)

And here’s my input matrix (100,000x4) and the preallocated output matrix (9x100,000) which contains a slot for each input value, each intermediate value, and the final output:

const inputs = rand(Float64, (100000, 4))
const result = zeros(Float64, (9, 100000))
result[1:4,:] = inputs'

Now I have two ways I can execute the layers for the input. I can provide the layers as a list to my executor function:

function runita!(array, layers)
    @inbounds for row in eachindex(array[1,:])
        for layer in layers
            layer(array, row)
        end
    end
end

Or I can specialise for a specific number of layers:

function runit!(array, l1, l2, l3, l4, l5)
    @inbounds for row in eachindex(array[1,:])
        l1(array, row)
        l2(array, row)
        l3(array, row)
        l4(array, row)
        l5(array, row)
    end
end

The benchmark results are surprisingly slow and allocation intensive for the first (and more flexible) array method:

@benchmark runita!(result, [l1, l2, l3, l4, l5])

  memory estimate:  15.98 MiB                       <-- high memory
  allocs estimate:  997458                          <-- high allocations
  --------------
  minimum time:     22.993 ms (0.00% GC)            <-- 3-4x slower
  median time:      28.370 ms (0.00% GC)
  mean time:        29.014 ms (3.92% GC)
  maximum time:     52.257 ms (13.16% GC)
  --------------
  samples:          173
  evals/sample:     1

@benchmark runit!(result, l1, l2, l3, l4, l5)

  memory estimate:  781.33 KiB
  allocs estimate:  2
  --------------
  minimum time:     7.111 ms (0.00% GC)
  median time:      7.598 ms (0.00% GC)
  mean time:        7.743 ms (0.55% GC)
  maximum time:     14.210 ms (35.91% GC)
  --------------
  samples:          645
  evals/sample:     1

I’ve tried profiling but being new to Julia it hasn’t really helped me work out where the bottleneck is. I suspect it’s something to do with the boxed function in the array, but even using exactly the same function in all layers didn’t help it specialise.

Any support or ideas would be much appreciated!

1 Like

you might want to try to benchmark in a more standard way:

ls =  [l1, l2, l3, l4, l5]
@benchmark runita!($result, $ls)

@benchmark runit!($result, $l1, $l2, $l3, $l4, $l5)

you might want to look at @code_llvm to see how the loop is unrolled or not.

Thanks, I’ve just tried that. Made no difference to results.

1 Like

The source of the type instability is the fact that you can enter into any of those if branches and come out with a different function (and each function is a different type). You can make inputCols a StaticArray and dispatch on length instead of having a single function with branches.

2 Likes

The network should be a Tuple rather than a Vector. Each of lᵢ are inherently a different type, and so storing them in an array aggregates their type to Function, which is not very specific. Then it shouldn’t matter if you use a for loop.

edit: my previous benchmark was wrong, as my function wasn’t doing the same thing. However, you can unroll the loop using Base.@nexprs to get your faster benchmark with more flexible code

julia> @generated function runitq!(array, layers...)
           N = length(layers)
           quote
               for r in axes(array, 2)
                   Base.@nexprs $N i -> (layers[i])(array, r)
               end
           end
       end
runitq! (generic function with 1 method)

julia> @btime runitq!($result, $l1, $l2, $l3, $l4, $l5)
  8.480 ms (0 allocations: 0 bytes)
2 Likes

Won’t that only be true up to the union-splitting limit?

Definitely worth trying the tuple approach on its own, but if that’s not sufficient here’s the old-school approach:

function runita!(array, layers)
    @inbounds for row in axes(array, 2)
        mapeach(layers, array, row)
    end
end

@inline mapeach(fs::Fs, args...) where Fs = (fs[1](args...); mapeach(Base.tail(fs), args...))
mapeach(::Tuple{}, args...) = nothing

It works by ensuring a separate call site for each entry in layers, which allows Julia to perform the method-lookup at compile time rather than run time. (It works by peeling off the first item and calls it, then generates a new tuple from the remaining layers in the tail and recursively calls itself.)

The seemingly-unnecessary Fs forces Julia to specialize mapeach even when Julia’s heuristics would otherwise tell it not to. Don’t use this with very long tuples, though.

2 Likes

This is the same as using map, no? At least I’m seeing the same timing (although maybe mapeach ensures specialization for larger tuples?). Notably, foreach doesn’t seem to specialize the way map does on tuples.

function runita!(array, layers)
    @inbounds for row in axes(array, 2)
        mapeach(layers, array, row)
    end
end

function runita2!(array, layers)
    @inbounds for row in axes(array, 2)
        map(l->l(array,row), layers)
    end
end

function runita3!(array, layers)
    @inbounds for row in axes(array, 2)
        foreach(l->l(array,row), layers)
    end
end
julia> begin 
           @btime runita!($result, $layers)    # mapeach
           @btime runita2!($result, $layers)   # map
           @btime runita3!($result, $layers)   # foreach
       end
  8.593 ms (0 allocations: 0 bytes)
  8.867 ms (0 allocations: 0 bytes)
  40.899 ms (1300000 allocations: 73.24 MiB)

Yes, this example is the same as map. I still think it’s worthwhile knowing how you implement these things “from scratch” in case you need to change behavior. For example, if you wanted to use the output of the previous call as an input to the next.

3 Likes

Recursive methods on tuples are so useful for type stability in julia. But people seem to reach for vectors and broadcast instead. Maybe this needs some more awareness.

It is a very useful, but somewhat advanced technique. I was thinking about making a PR to the manual with some examples of the Lispy recursion techniques, but then my concerns are that

  1. people who don’t really need it would think that this is something to use on a regular basis (similarly to metaprogramming),

  2. it is very easy to overwhelm the compiler by overdoing this.

1 Like

Thank you all, this was tremendously helpful. Using mapeach has normalised speed as you’ve already demonstrated. What a fantastic community!

It does lead me to a quick follow up so I can understand this a little better. I think one of the challenges for someone new to Julia is recognising why something can be known at compile time vs run time given the JIT nature of compilation. Given this is a huge source of performance gain / pain, ways to build intuition around this are useful.

So here’s my followup.

In my use case, the layers themselves are built up using a stochastic process that will choose different sizes, connections and activation functions, a runnable example (I’m sure there’s a whole lot that not idiomatic here, literally day 2 on Julia):

using Random

funcs = [sin, tanh, cos, identity, sigmoid, relu]

function generate_layer(potential_funcs, potential_inputs)
    size = rand([2,3,4])
    inputs = rand(potential_inputs, size)
    weights = rand(Float64,length(inputs))
    func = rand(potential_funcs)
    inputs, weights, func
end

function generate_layers(potential_funcs, starting_inputs, numoflayers)
    inputs = starting_inputs
    layers = []
    for i in Base.OneTo(numoflayers)
        generated = generate_layer(potential_funcs, inputs)
        output = length(inputs) + 1
        push!(inputs, output)
        push!(layers, compileLayer(generated[1], generated[2], generated[3], output))
    end
    return tuple(layers...)
end

Generating layers itself is fast:

julia> @btime generate_layers($funcs, [1,2,3,4], rand(Base.OneTo(5)))
  899.900 ns (9 allocations: 747 bytes)

But executing them produces interesting results:

julia> @benchmark begin
           layer = generate_layers($funcs, [1,2,3,4], rand(Base.OneTo(5)))
           runita!($result, layer)
       end

BenchmarkTools.Trial: 
  memory estimate:  688 bytes                   <-- doesn't seem to be dynamically dispatching anymore
  allocs estimate:  9
  --------------
  minimum time:     667.399 μs (0.00% GC)       <-- very fast minimum time (small layer?)
  median time:      30.591 ms (0.00% GC)        <-- why so high?
  mean time:        25.487 ms (0.64% GC)
  maximum time:     73.768 ms (0.00% GC)        <-- explained by JIT compilation?
  --------------
  samples:          197
  evals/sample:     1

My thinking is that the small number of total samples means compilation of new combinations of tuple length + chosen function in a certain position ‘dominates’ the run. Is this the correct way to interpret this? That Julia is essentially JIT compiling a new specialisation based on the different size and order of the one ‘unknown’ type (the activation function) in the call chain? If so, am I at risk of creating too large of a dispatch lookup if I have too many layers + too many potential functions?

FWIW, if I increase samples and evals per sample I get lower numbers and more stability, but even after 2 minutes of running it’s still quite a bit slower than the previous versions:

  memory estimate:  1014 bytes
  allocs estimate:  12
  --------------
  minimum time:     3.199 ms (0.00% GC)
  median time:      19.485 ms (0.00% GC)
  mean time:        19.903 ms (0.72% GC)
  maximum time:     38.494 ms (0.00% GC)
  --------------
  samples:          603
  evals/sample:     10