Static array over 1,024 elements is slow to compile?

I am trying to improve the compilation time of using the StaticArrays package in my use case. Currently, I have made a fork of the package and changed how broadcasting works to generate rolled-up loops instead of its previous behavior of generating unrolled expressions:

# Old generation
result[1] = a[1][1, 1] + a[2][1, 1]
result[2] = a[1][2, 1] + a[2][2, 1]
result[3] = a[1][3, 1] + a[2][3, 1]
result[4] = a[1][1, 2] + a[2][1, 2]
result[5] = a[1][2, 2] + a[2][2, 2]
result[6] = a[1][3, 2] + a[2][3, 2]
# New generation
for idx2 = 1:2
  for idx1 = 1:3
    result[idx1, idx2] = a[1][idx1, idx2] + a[2][idx1, idx2]
  end
end

I am then using this code to test the compile times:

function time_array(dims::Type)
    ArrayType = SArray{dims, Float32, length(dims.parameters), reduce(*, dims.parameters)}

    fn = function()
        similar(ArrayType) .* similar(ArrayType)
    end

    println(dims)
    @time eval(quote ($fn)() end)
end

time_array(Tuple{1024})

With this initial code, I am getting the timing improvements I was hoping for, this first set of dimensions results in a compile time of 0.5 seconds as opposed to roughly six seconds before my changes. However, this stops being the case if the array has any more than 1024 elements:

Tuple{1023}
  0.381930 seconds (125.93 k allocations: 7.079 MiB)

Tuple{1024}
  0.373275 seconds (125.92 k allocations: 7.078 MiB)

Tuple{1025}
  5.717773 seconds (125.94 k allocations: 7.080 MiB, 0.09% gc time)

Tuple{32,32}
  0.554469 seconds (508.30 k allocations: 26.966 MiB)

Tuple{32,33}
  5.741758 seconds (186.25 k allocations: 10.165 MiB, 0.09% gc time)

Tuple{2,3,4,5}
  0.269990 seconds (639.58 k allocations: 34.249 MiB)

What might be causing this large discontinuity?

For reference, this is the function I modified in the StaticArrays package:

@generated function _broadcast(f, ::Size{newsize}, s::Tuple{Vararg{Size}}, a...) where newsize
    first_staticarray = a[findfirst(ai -> ai <: Union{StaticArray,Transpose{<:Any,<:StaticArray},Adjoint{<:Any,<:StaticArray}}, a)]

    if prod(newsize) == 0
        # Use inference to get eltype in empty case (see also comments in _map)
        eltys = [:(eltype(a[$i])) for i ∈ 1:length(a)]
        return quote
            @_inline_meta
            T = Core.Compiler.return_type(f, Tuple{$(eltys...)})
            @inbounds return similar_type($first_staticarray, T, Size(newsize))()
        end
    end

    sizes = [sz.parameters[1] for sz ∈ s.parameters]
    indices = CartesianIndices(newsize)
    index_symbols = [Symbol(:idx, dimidx) for dimidx = 1:length(newsize)]

    first_exprs_vals = [
        begin
            if !(a[i] <: AbstractArray || a[i] <: Tuple)
                :(scalar_getindex(a[$i])) 
            else
                :(a[$i][1])
            end
        end
        for i = 1:length(sizes)
    ]
    first_expr = :(f($(first_exprs_vals...)))

    exprs_vals = [
        begin
            if !(a[i] <: AbstractArray || a[i] <: Tuple)
                :(scalar_getindex(a[$i])) 
            else
                index = [
                    if dim == newdim
                        index_symbols[dimidx]
                    else
                        :(1)
                    end
                    for (dimidx, (dim, newdim)) ∈ enumerate(zip(sizes[i], newsize))
                ]
                :(a[$i][$(index...)])
            end
        end
        for i = 1:length(sizes)
    ]
    compute_expr = :(elements[$(index_symbols...)] = f($(exprs_vals...)))
    for (dimidx, dim) in enumerate(newsize)
        compute_expr = quote
            for $(index_symbols[dimidx]) = 1:$(newsize[dimidx])
                $compute_expr
            end
        end
    end

    res = quote
        @_inline_meta
        @inbounds elements = similar($first_staticarray, typeof($first_expr), Size(newsize))
        @inbounds $compute_expr
        @inbounds return similar_type($first_staticarray, eltype(elements), Size(newsize))(elements)
    end
    # println(res)
    return res
end

StaticArrays of that size are going to stress the compiler. Could you expand on the underlying motivation? What are you actually trying to calculate? It may be considerably faster to use LoopVectorization with base-Julia arrays of that size.

julia> using StaticArrays, LoopVectorization, BenchmarkTools

julia> @time static = @SVector rand(2^10);
  9.695215 seconds (4.70 M allocations: 195.433 MiB, 2.41% gc time, 99.99% compilation time)

julia> @time static .* static
  2.996434 seconds (3.19 M allocations: 151.434 MiB, 1.88% gc time, 99.98% compilation time)

julia> @btime $static .* $static;
  980.000 ns (0 allocations: 0 bytes)
julia> array = rand(1024);

julia> @time @avx array .* array;
 11.953388 seconds (12.37 M allocations: 705.798 MiB, 2.18% gc time, 100.00% compilation time)

julia> @btime @avx $array .* $array;
  774.176 ns (1 allocation: 8.12 KiB)

julia> @btime @avxt $array .* $array;
  693.671 ns (1 allocation: 8.12 KiB)

**edit: or just plain broadcasting!

julia> @btime $array .* $array;
  740.244 ns (1 allocation: 8.12 KiB)
2 Likes

I’m only passingly familiar with the implementation details of StaticArrays, so take this with a grain of salt…

I believe that SVector is just an NTuple under the hood, and so I suspect that you are seeing the effect of some heuristic in Base relating to very long Tuples. That said, I can’t imagine that you will gain much performance using static arrays with vectors that long.

Quite possibly you will lose performance. Realize that StaticArrays tries to completely unroll everything, but if you unroll too long a calculation then you will spill out of the instruction cache.

3 Likes

In this case, it seems that you’ll get the best performance by preallocating an output array and using vmap!:

julia> out = similar(array);

julia> @btime vmap!(*, $out, $array, $array);
  140.210 ns (0 allocations: 0 bytes)
3 Likes

If LoopVectoriztion doesn’t win a benchmark, that’s grounds for filing an issue. :wink:
E.g., here or here.
Doesn’t mean I’ll be able to address it immediately, but it’d be something for me to look into.

In this case, it should be fixed in LoopVectorization 0.12.8 (just released).

One important thing to note is that this benchmark:

julia> array = randn(1024); out = similar(array);

julia> @btime $out .= $array .* $array;
  56.647 ns (0 allocations: 0 bytes)

is bottlenecked by loading from the two arrays: array and array.
Of course, array === array here, so:

julia> @btime $out .= abs2.($array);
  44.697 ns (0 allocations: 0 bytes)

Something that (now as of 0.12.8) works for LoopVectorization is wrapping them in a function.
It now inlines the broadcasting code, so even though LoopVectorization has no way of knowing array === array, LLVM knows that:

julia> using LoopVectorization

julia> @inline avxmul!(out,array) = @avx @. out = array * array
avxmul! (generic function with 1 method)

julia> @btime avxmul!($out, $array);
  44.482 ns (0 allocations: 0 bytes)

julia> @btime @avx @. $out = abs2($array);
  44.626 ns (0 allocations: 0 bytes)

But this doesn’t seem to work for base broadcasting:

julia> @inline basemul!(out, array) = @. out = array * array
basemul! (generic function with 1 method)

julia> @btime basemul!($out, $array);
  57.162 ns (0 allocations: 0 bytes)

Which we can confirm by looking at the asm, base:

L1056:
        vmovupd zmm0, zmmword ptr [rax + 8*rdi]
        vmovupd zmm1, zmmword ptr [rax + 8*rdi + 64]
        vmovupd zmm2, zmmword ptr [rax + 8*rdi + 128]
        vmovupd zmm3, zmmword ptr [rax + 8*rdi + 192]
        vmulpd  zmm0, zmm0, zmmword ptr [rcx + 8*rdi]
        vmulpd  zmm1, zmm1, zmmword ptr [rcx + 8*rdi + 64]
        vmulpd  zmm2, zmm2, zmmword ptr [rcx + 8*rdi + 128]
        vmulpd  zmm3, zmm3, zmmword ptr [rcx + 8*rdi + 192]
        vmovupd zmmword ptr [rdx + 8*rdi], zmm0
        vmovupd zmmword ptr [rdx + 8*rdi + 64], zmm1
        vmovupd zmmword ptr [rdx + 8*rdi + 128], zmm2
        vmovupd zmmword ptr [rdx + 8*rdi + 192], zmm3
        add     rdi, 32
        cmp     rsi, rdi
        jne     L1056

Note we load from one vector with vmovupd, and then vmulpd loads from a second vector while multiplying with the previously loaded vector.
@avx:

L48:
        vmovupd zmm0, zmmword ptr [rdi + 8*rsi]
        vmovupd zmm1, zmmword ptr [rdi + 8*rsi + 64]
        vmovupd zmm2, zmmword ptr [rdi + 8*rsi + 128]
        vmovupd zmm3, zmmword ptr [rdi + 8*rsi + 192]
        vmulpd  zmm0, zmm0, zmm0
        vmulpd  zmm1, zmm1, zmm1
        vmulpd  zmm2, zmm2, zmm2
        vmulpd  zmm3, zmm3, zmm3
        vmovupd zmmword ptr [rdx + 8*rsi], zmm0
        vmovupd zmmword ptr [rdx + 8*rsi + 64], zmm1
        vmovupd zmmword ptr [rdx + 8*rsi + 128], zmm2
        vmovupd zmmword ptr [rdx + 8*rsi + 192], zmm3
        add     rsi, 32
        cmp     rsi, rcx
        jle     L48

Which is the same code abs2 produces.

Anyway, my favorite benchmark:

@inline basemulabs2!(out, array) = @. out = abs2(array)

using Random
Ns = shuffle(axes(array,1));
foreachn(f::F, out, array, Ns) where {F} = foreach(n -> f(view(out, Base.OneTo(n)), view(array, Base.OneTo(n))), Ns)

@btime foreachn(avxmul!, $out, $array, $Ns);
@btime foreachn(basemulabs2!, $out, $array, $Ns);

Yielding:

julia> @btime foreachn(avxmul!, $out, $array, $Ns);
  26.862 μs (0 allocations: 0 bytes)

julia> @btime foreachn(basemulabs2!, $out, $array, $Ns);
  36.207 μs (0 allocations: 0 bytes)

But this performance advantage is going to be much smaller without AVX512.

Anyway, yes, don’t use StaticArrays at these sizes.
They’re very unfriendly to the compiler, and using NTuples as arrays, or even using LLVM’s array type as arrays, doesn’t really work.
LLVM arrays require constant indices, which means that there is no way to work with them other than to unroll everything. For example, Base.setindex:

julia> @code_typed Base.setindex((1,2,3,4), 3, 4)
CodeInfo(
1 ─       goto #8 if not $(Expr(:boundscheck))
2 ─ %2  = Base.sle_int(1, i)::Bool
└──       goto #4 if not %2
3 ─ %4  = Base.sle_int(i, 4)::Bool
└──       goto #5
4 ─       nothing::Nothing
5 ┄ %7  = φ (#3 => %4, #4 => false)::Bool
└──       goto #7 if not %7
6 ─       goto #8
7 ─ %10 = Base.BoundsError(x, i)::Any
│         Base.throw(%10)::Union{}
└──       unreachable
8 ┄ %13 = Core.getfield(x, 1)::Int64
│   %14 = Core.getfield(x, 2)::Int64
│   %15 = Core.getfield(x, 3)::Int64
│   %16 = Core.getfield(x, 4)::Int64
│   %17 = (i === 1)::Bool
│   %18 = Base.ifelse(%17, v, %13)::Int64
│   %19 = Base.sub_int(i, 1)::Int64
│   %20 = (%19 === 1)::Bool
│   %21 = Base.ifelse(%20, v, %14)::Int64
│   %22 = Base.sub_int(%19, 1)::Int64
│   %23 = (%22 === 1)::Bool
│   %24 = Base.ifelse(%23, v, %15)::Int64
│   %25 = Base.sub_int(%22, 1)::Int64
│   %26 = (%25 === 1)::Bool
│   %27 = Base.ifelse(%26, v, %16)::Int64
│   %28 = Core.tuple(%18, %21, %24, %27)::NTuple{4, Int64}
└──       return %28
) => NTuple{4, Int64}

Works by creating a new tuple, using ifelse to select which one to replace.

Julia’s codegen can sometimes instead gain access to pointers to NTuples, but users – and base functions like Base.setindex cannot AFAIK.
The best we can do is wrapping them in a mutable struct of some sort, which is what a StaticArrays.MVector is.

8 Likes

I was hoping that by using StaticArrays I would avoid costs associated with Julia’s arrays being dynamically allocated and resizable. To be clear, I had explicitly rewritten a part of the library to remove the unrolling behavior from broadcasting. It is sounding like it is recommended to just use the plain Julia arrays though. I am worried if this would impart a noticeable runtime cost though. For context, I am writing audio processing code with this. As an example, I might have a situation where I have a stereo audio buffer (2 channels, 512 samples per channel) and want to multiply it by a stereo sample to do panning:

panned_audio = unpanned_audio .* SA_F32[1.0f0; 0.5f0]

If the arrays are both static, the broadcasting logic can be done at compile time. In my case, I’ve modified the StaticArrays package to produce a simple for loop expression instead of a fully unrolled expression. My understanding is that if I were to replace this with regular Julia arrays, then the size would only be available at runtime, necessitating the execution of complex broadcasting logic every time this expression is encountered. Is this concern justified or does this not have to happen?

The complicated parts of the broadcasting logic are mostly done at compile-time, because they only depend on the types of the containers etc. All that’s left at runtime are (1) a loop and (2) an allocation for the result. (2) can be eliminated if you pre-allocate your output buffer and re-use it with .=.

In otherwords, panned_audio .= unpanned_audio .* SA_F32[1.0f0; 0.5f0] (changing to preallocated output) should be essentially equivalent to:

(size(panned_audio) == size(unpanned_audio) && size(panned_audio,2) == length(v)) || throw(DimensionMismatch())
v = SA_F32[1.0f0; 0.5f0]
for j=1:size(panned_audio,2), i=1:size(panned_audio,1)
    panned_audio[i,j] = unpanned_audio[i,j] * v[j]
end

which isn’t bad at all.

StaticArrays are mainly useful when you have lots of tiny arrays (e.g. for 3d coordinate vectors), and preallocating everything is painful. Also, for tiny arrays the overhead of looping and size checks can be more significant, and the compiler can do nice tricks with the unrolled code (e.g. putting everything in registers). Neither of these is really an issue for large arrays.

8 Likes

You can try the package HybridArrays, which is static along some axes and dynamic along others. In this case it appears to give a nice speedup:

using HybridArrays, StaticArrays

Au = rand(Float32, 2, 512);
Ap = similar(Au);

p = SA[1.0f0, 0.5f0];

Auh = HybridMatrix{2, StaticArrays.Dynamic()}(Au);
Aph = similar(Auh);

Timing:

jl> @btime $Ap .= $Au .* $p;
  862.745 ns (0 allocations: 0 bytes)

jl> @btime $Aph .= $Auh .* $p;  # hybrid matrix
  222.333 ns (0 allocations: 0 bytes)

I’d love to try out the new LoopVectorization for comparison, but cannot install anything beyond v0.10.

4 Likes

I ended up making a wrapper type around Array which holds the size of the array as a parameter so that I can retain some of the readability and benefits of that system while still storing things in regular arrays under the hood rather than excessively long tuples.

1 Like

Note that this will yet compile a different version of each function that takes the wrapper for each distinct size parameter that appears during the computations. There is some chance this will end up as the worst of the two worlds instead of the best, unless you carefully control when to pass the unwrapped array to a function and when to pass the wrapped one, or you just change sizes very seldomly.

5 Likes