StaticTools

Hi,

First: I am very much looking forward to StaticCompiler.jl on Windows, I will be putting it to good use!

I understand it’s early days, but I wanted to try out StaticTools.jl to get a feel for the “static dialect”. The code

using StaticTools, BenchmarkTools,StaticArrays

function foo(v)
    a = StackArray(ntuple(i->v,6), 6)
    return sum(a) 
end
function bar(v)
    a = SVector{5,Float64}(v,v,v,v,v)
    return a[1]+a[2]+a[3]+a[4]+a[5] 
end

@btime foo(2)
@btime bar(2)

yields

  10.210 ns (1 allocation: 64 bytes)
  0.800 ns (0 allocations: 0 bytes)

and I feel cheated: creating a StackArray allocates? On the heap?

What did I get wrong?

Your benchmark looks a bit odd: The lengths of the arrays are not equal and the way you sum them is also different.

But let’s start with the StaticArrays.jl variant bar: The result looks like and indeed is due to constant-folding (sub 1ns runtime). See here:

julia> function bar(v)
    a = SVector{5,Float64}(v,v,v,v,v)
    return sum(a)
end
julia> @btime bar(2)
  0.978 ns (0 allocations: 0 bytes)
10.0

julia> @btime bar($2) # interpolation avoids constant-folding
  3.352 ns (0 allocations: 0 bytes)
10.0

Using sum or direct array access to sum does not make a difference here. I am sure StaticArrays.jl took care that this call is fully inlined to the code that just sums the array entries.

Ok let’s now look at StackArray (which I have no prior experience with).
There are no allocations, when I do the summation directly:

julia> function foo_direct(v)
    a = StackArray(ntuple(i->v,5), 5) # changed size to 5 to match bar
    return a[1]+a[2]+a[3]+a[4]+a[5]
end
julia> @btime foo_direct($2) # interpolation does not make a difference here
  3.282 ns (0 allocations: 0 bytes)
10

However using sum, the allocation appears:

julia> function foo_sum(v)
           a = StackArray(ntuple(i->v,5), 5)
           return sum(a)
       end
foo_sum (generic function with 1 method)

julia> @btime foo_sum($2)
  7.970 ns (1 allocation: 48 bytes)
10

Inspecting the @code_llvm it seems that the code allocates an array, copies the values from the StackArray into it and then uses it to call a mapreduce function. To me that looks like there is some specialized method missing to efficiently sum a StackArray.

Oh thanks, I never thought of that. sum allocates - which actually makes sense.