Stack-allocated arrays without excessive compilation time

I find SArray very useful because they are stack-allocated — this is important generally, but specifically for threaded code I find that minimizing allocations is a must.

However, I also fine that excessive use of SArray, especially for nested types (which can be also SArrays, or ForwardDiff.Dual, etc) really taxes the compiler.

I am looking for a general strategy of implementing a vector (or array, that follows easily) which

  1. can be stack-allocated,
  2. but does not generate specialized code based on dimensions.

Is this possible? Should I use NTuple as a data storage, or someting else?

Here is an attempt, but I think that this still specializes code:

using StaticArrays, BenchmarkTools, MethodAnalysis
import LinearAlgebra: dot

struct StackVector{L,T}
    data::NTuple{L,T}
end

function dot(a::StackVector{L}, b::StackVector{L}) where L
    A = a.data
    B = b.data
    @nospecialize L
    s = zero(first(A) * first(B))
    @inbounds @simd for i in 1:L
        s += A[i] * B[i]
    end
    s
end

v = StackVector(ntuple(_ -> rand(), Val(1000)))
s = SVector(v.data)

@btime dot(s, s)
@btime dot(v, v)
methodinstances(dot) # still get specialization for L = 1000

Eg in the above, I would prefer to have the compiler invest as much effort into the relevant method for dot as it would for dot(::Vector{T}, ::Vector{S}).

This is exploratory, if the question is unclear please just ask.

7 Likes

Well, there is @nospecialize but I think it is not recursive, and it cannot be used for function calls only function definitions. So it would only help to mitigate the problem a little.

Creating a wrapper subtyping DenseArray with an SArray abstract non-parametric field could work, but probably the boxing of the SArray will allocate (the type of the wrapper would always the same, but it would delegate all the work to a stack-allocated SArray).

I guess you could write something low-level based on the LLVM alloca intrinsic, but you’d have to be careful about safety (e.g. not allowing it to escape the calling scope) when using it.

2 Likes

The long term answer for this is that Julia could make NTuple more efficient for the compiler. Currently NTuple is basically just an alias to a Tuple, but theoretically the compiler could do special tracking that would make large SArrays more efficient.

5 Likes

And make SArrays more memory efficient by encoding the shared type once rather than N times, once for each element of the alias to a Tuple?

1 Like

I think this isn’t a memory efficiency issue since (type stable) static array code will usually specialize down to machine operations. By the time the code actually runs, the type information usually has disappeared for all of the intermediate computations.

1 Like

It doesn’t work with llvmcall, because llvmcall creates an inlined llvm function.
Despite being inlined, it preserves the lifetime information: the memory is immediately freed because the function whose body is just %ptr = alloca i64, i64 %N immediately returns when the alloca is complete.

The program owns the data, so you won’t get a segfault. But you will overwrite the memory of other allocas/stack objects.

A suggestion here would be to use MArray + UnsafeArrays. That is, GC.@preserve the memory surrounding all use to preserve the lifetime.
The array itself won’t escape when you pass the pointers around to functions, so it will be stack allocated.

As getindex and setindex! are defined in a way bypassing the tuple (unsafe_load/unsafe_store! on the pointer), compile times avoid the NTuple penalty (but this may already be much better?).

Importantly, if you define your own fat-pointer array type, you can avoid writing a bunch of methods that all manually unroll it. By not doing that, compile times should be good.
Optionally, you could make your fat-pointer dynamically sized. But, if you only have 1 size, then static size information by itself will not hurt compile times (it’s tuples and manual unrolling that are the problem).

3 Likes

A relevant julia issue

2 Likes

Just checking that I understand correctly: if I wrapped an NTuple like SArray, but avoided all that @generated code, wouldn’t the compiler still compile a method for each size? This is what I don’t know how to avoid.

Okay, if you have a lot of different sizes, that (different sizes when just avoiding @generated code) is still a problem, but the PtrArray approach still works. Hold on for an example…

@Tamas_Papp This example requires PaddedMatrices’ master branch, but I should make a new release within the next few days.
It doesn’t support Dual numbers yet, so you may also want to either request that I add support for that, or take the idea for your own implementation of this approach.

Example:

julia> using PaddedMatrices, VectorizationBase, ArrayInterface

julia> using VectorizationBase: StridedPointer

julia> @noinline function foo(A)
           A[1] = 1 / A[1]
           A[1] + A[2]
       end
foo (generic function with 1 method)

julia> function make_stride_dynamic(p::StridedPointer{T,N,C,B,R}) where {T,N,C,B,R}
           StridedPointer{T,N,C,B,R}(p.p, map(Int, p.strd), p.offsets)
       end
make_stride_dynamic (generic function with 1 method)

julia> function make_dynamic(A::PtrArray)
           PtrArray(make_stride_dynamic(stridedpointer(A)), Base.size(A), ArrayInterface.dense_dims(A))
       end
make_dynamic (generic function with 1 method)

julia> function make_dynamic(A::StrideArray)
           StrideArray(make_dynamic(PtrArray(A)), A.data)
       end
make_dynamic (generic function with 2 methods)

julia> function bar(M, N, iter = 100)
           S = @StrideArray rand(M, N)
           D = make_dynamic(S)
           s = 0.0
           for i ∈ 1:iter
               # macro currently doesn't look through nested expressions
               s += @gc_preserve foo(D)
           end
           s
       end
bar (generic function with 2 methods)

julia> @allocated bar(StaticInt(1), StaticInt(4), 100)
0

julia> @allocated bar(StaticInt(2), StaticInt(4), 100)
0

julia> @allocated bar(StaticInt(3), StaticInt(4), 100)
0

julia> @allocated bar(StaticInt(4), StaticInt(4), 100)
0

julia> @allocated bar(StaticInt(5), StaticInt(4), 100)
0

julia> @allocated bar(StaticInt(2), StaticInt(5), 100)
0

julia> @allocated bar(StaticInt(2), StaticInt(6), 100)
0

julia> @allocated bar(StaticInt(2), StaticInt(7), 100)
0

julia> @allocated bar(StaticInt(2), StaticInt(8), 100)
0

julia> @allocated bar(StaticInt(2), StaticInt(9), 100)
0

julia> using MethodAnalysis

julia> mis = methodinstances(foo)
1-element Vector{Core.MethodInstance}:
 MethodInstance for foo(::PtrArray{Tuple{Int64, Int64}, (true, true), Float64, 2, 1, 0, (1, 2), Tuple{Int64, Int64}, Tuple{StaticInt{1}, StaticInt{1}}})

A rundown of what bar is doing:

  1. Creates a random array. If the dimensions are StaticInts, this array is statically sized and will (given certain conditions) be stack allocated.
  2. It makes the array dynamic, by discarding the static size and and stride information. The static tuple information remains.
  3. Repeatedly calls foo. foo is not inlined (@noinline), and foo also mutates the array.

However, the @gc_preserve macro (which is still primitive, it doesn’t actually descend/walk expressions…) GC.@preserves the memory backing the array and sheds the last bit of static tuple size info, before passing the array to foo.

Thus, @allocated shows 0.
After trying a variety of different sizes, I call methodinstances to confirm that only a single foo method has been compiled, because the arrays are indeed dynamically sized.

3 Likes

I’m not sure if it’s the same thing happening here, but we found very similarly bad compilation times from our implementation of FixedSizeStrings (quite similar to JC’s JuliaComputing/FixedSizeStrings.jl), that turned out to be caused by a bad interaction between julia’s codegen and LLVM unrolling some operations over the static array into very, very large functions.

You should try the above experiments with -O1 and -O0 to see if the problem disappears. If it does, this is a repeat of the same problem we saw.

@jeff.bezanson has been looking into the issue in LLVM’s optimization passes themselves, but I’m not sure where that’s being tracked. Last I heard there wasn’t anything obvious sticking out, unfortunately. :confused:

We ended up doing a rewrite of our FixedSizeStrings that backs them with a bunch of variably sized primitive integers, all the way up to 2048-bit integers (:scream:!), by generating a bunch of these in a loop: @eval primitive type Str$N end.

This actually did massively improve our compilation times, because it seems to be enough to prevent LLVM from unrolling operations on them (:roll_eyes:). I’ll try to write up a small blog post explaining what we did, in case it’s helpful. Maybe StaticArrays could do the same?

3 Likes