I have a very performance sensitive application that needs to activate a neural network on a per row basis for a large number of rows.

Consider a network with 4 input values and 1 output value, with a few layers in between:

I can define this as 5 independent calculations to be done in order, each one optimised for a specific number of inputs (only optimising for 2 to 4 here):

```
function compileLayer(inputCols, weights, activationFunction, outputCol)
if length(inputCols) == 2
(array, row) -> array[outputCol, row] = activationFunction(array[inputCols[1], row]*weights[1] + array[inputCols[2], row]*weights[2])
elseif length(inputCols) == 3
(array, row) -> array[outputCol, row] = activationFunction(array[inputCols[1], row]*weights[1] + array[inputCols[2], row]*weights[2] + array[inputCols[3], row]*weights[3])
elseif length(inputCols) == 4
(array, row) -> array[outputCol, row] = activationFunction(array[inputCols[1], row]*weights[1] + array[inputCols[2], row]*weights[2] + array[inputCols[3], row]*weights[3] + array[inputCols[4], row]*weights[4])
end
end
l1 = compileLayer([1,2,3], [1.0,2.0,3.0], sin, 5)
l2 = compileLayer([1,2,3,4], [1.0,2.0,3.0,0.5], sin, 6)
l3 = compileLayer([5,6], [1.0,2.0], sin, 7)
l4 = compileLayer([5,6,7], [1.0,2.0,1.5], identity, 8)
l5 = compileLayer([7,8], [1.0,2.0], sin, 9)
```

And hereās my input matrix (100,000x4) and the preallocated output matrix (9x100,000) which contains a slot for each input value, each intermediate value, and the final output:

```
const inputs = rand(Float64, (100000, 4))
const result = zeros(Float64, (9, 100000))
result[1:4,:] = inputs'
```

Now I have two ways I can execute the layers for the input. I can provide the layers as a list to my executor function:

```
function runita!(array, layers)
@inbounds for row in eachindex(array[1,:])
for layer in layers
layer(array, row)
end
end
end
```

Or I can specialise for a specific number of layers:

```
function runit!(array, l1, l2, l3, l4, l5)
@inbounds for row in eachindex(array[1,:])
l1(array, row)
l2(array, row)
l3(array, row)
l4(array, row)
l5(array, row)
end
end
```

The benchmark results are surprisingly slow and allocation intensive for the first (and more flexible) array method:

```
@benchmark runita!(result, [l1, l2, l3, l4, l5])
memory estimate: 15.98 MiB <-- high memory
allocs estimate: 997458 <-- high allocations
--------------
minimum time: 22.993 ms (0.00% GC) <-- 3-4x slower
median time: 28.370 ms (0.00% GC)
mean time: 29.014 ms (3.92% GC)
maximum time: 52.257 ms (13.16% GC)
--------------
samples: 173
evals/sample: 1
@benchmark runit!(result, l1, l2, l3, l4, l5)
memory estimate: 781.33 KiB
allocs estimate: 2
--------------
minimum time: 7.111 ms (0.00% GC)
median time: 7.598 ms (0.00% GC)
mean time: 7.743 ms (0.55% GC)
maximum time: 14.210 ms (35.91% GC)
--------------
samples: 645
evals/sample: 1
```

Iāve tried profiling but being new to Julia it hasnāt really helped me work out where the bottleneck is. I suspect itās something to do with the boxed function in the array, but even using exactly the same function in all layers didnāt help it specialise.

Any support or ideas would be much appreciated!