# Stack-allocated arrays without excessive compilation time

I find `SArray` very useful because they are stack-allocated — this is important generally, but specifically for threaded code I find that minimizing allocations is a must.

However, I also fine that excessive use of `SArray`, especially for nested types (which can be also `SArray`s, or `ForwardDiff.Dual`, etc) really taxes the compiler.

I am looking for a general strategy of implementing a vector (or array, that follows easily) which

1. can be stack-allocated,
2. but does not generate specialized code based on dimensions.

Is this possible? Should I use `NTuple` as a data storage, or someting else?

Here is an attempt, but I think that this still specializes code:

``````using StaticArrays, BenchmarkTools, MethodAnalysis
import LinearAlgebra: dot

struct StackVector{L,T}
data::NTuple{L,T}
end

function dot(a::StackVector{L}, b::StackVector{L}) where L
A = a.data
B = b.data
@nospecialize L
s = zero(first(A) * first(B))
@inbounds @simd for i in 1:L
s += A[i] * B[i]
end
s
end

v = StackVector(ntuple(_ -> rand(), Val(1000)))
s = SVector(v.data)

@btime dot(s, s)
@btime dot(v, v)
methodinstances(dot) # still get specialization for L = 1000
``````

Eg in the above, I would prefer to have the compiler invest as much effort into the relevant method for `dot` as it would for `dot(::Vector{T}, ::Vector{S})`.

7 Likes

Well, there is `@nospecialize` but I think it is not recursive, and it cannot be used for function calls only function definitions. So it would only help to mitigate the problem a little.

Creating a wrapper subtyping `DenseArray` with an `SArray` `abstract` non-parametric field could work, but probably the boxing of the `SArray` will allocate (the type of the wrapper would always the same, but it would delegate all the work to a stack-allocated `SArray`).

I guess you could write something low-level based on the LLVM `alloca` intrinsic, but you’d have to be careful about safety (e.g. not allowing it to escape the calling scope) when using it.

2 Likes

The long term answer for this is that Julia could make `NTuple` more efficient for the compiler. Currently `NTuple` is basically just an alias to a `Tuple`, but theoretically the compiler could do special tracking that would make large `SArray`s more efficient.

5 Likes

And make SArrays more memory efficient by encoding the shared type once rather than N times, once for each element of the alias to a `Tuple`?

1 Like

I think this isn’t a memory efficiency issue since (type stable) static array code will usually specialize down to machine operations. By the time the code actually runs, the type information usually has disappeared for all of the intermediate computations.

1 Like

It doesn’t work with `llvmcall`, because `llvmcall` creates an inlined llvm function.
Despite being inlined, it preserves the lifetime information: the memory is immediately freed because the function whose body is just `%ptr = alloca i64, i64 %N` immediately returns when the `alloca` is complete.

The program owns the data, so you won’t get a segfault. But you will overwrite the memory of other allocas/stack objects.

A suggestion here would be to use `MArray` + `UnsafeArray`s. That is, `GC.@preserve` the memory surrounding all use to preserve the lifetime.
The array itself won’t escape when you pass the pointers around to functions, so it will be stack allocated.

As `getindex` and `setindex!` are defined in a way bypassing the tuple (`unsafe_load`/`unsafe_store!` on the pointer), compile times avoid the `NTuple` penalty (but this may already be much better?).

Importantly, if you define your own fat-pointer array type, you can avoid writing a bunch of methods that all manually unroll it. By not doing that, compile times should be good.
Optionally, you could make your fat-pointer dynamically sized. But, if you only have 1 size, then static size information by itself will not hurt compile times (it’s tuples and manual unrolling that are the problem).

3 Likes

A relevant julia issue

2 Likes

Just checking that I understand correctly: if I wrapped an `NTuple` like `SArray`, but avoided all that `@generated` code, wouldn’t the compiler still compile a method for each size? This is what I don’t know how to avoid.

Okay, if you have a lot of different sizes, that (different sizes when just avoiding `@generated` code) is still a problem, but the `PtrArray` approach still works. Hold on for an example…

@Tamas_Papp This example requires PaddedMatrices’ master branch, but I should make a new release within the next few days.
It doesn’t support `Dual` numbers yet, so you may also want to either request that I add support for that, or take the idea for your own implementation of this approach.

Example:

``````julia> using PaddedMatrices, VectorizationBase, ArrayInterface

julia> using VectorizationBase: StridedPointer

julia> @noinline function foo(A)
A[1] = 1 / A[1]
A[1] + A[2]
end
foo (generic function with 1 method)

julia> function make_stride_dynamic(p::StridedPointer{T,N,C,B,R}) where {T,N,C,B,R}
StridedPointer{T,N,C,B,R}(p.p, map(Int, p.strd), p.offsets)
end
make_stride_dynamic (generic function with 1 method)

julia> function make_dynamic(A::PtrArray)
PtrArray(make_stride_dynamic(stridedpointer(A)), Base.size(A), ArrayInterface.dense_dims(A))
end
make_dynamic (generic function with 1 method)

julia> function make_dynamic(A::StrideArray)
StrideArray(make_dynamic(PtrArray(A)), A.data)
end
make_dynamic (generic function with 2 methods)

julia> function bar(M, N, iter = 100)
S = @StrideArray rand(M, N)
D = make_dynamic(S)
s = 0.0
for i ∈ 1:iter
# macro currently doesn't look through nested expressions
s += @gc_preserve foo(D)
end
s
end
bar (generic function with 2 methods)

julia> @allocated bar(StaticInt(1), StaticInt(4), 100)
0

julia> @allocated bar(StaticInt(2), StaticInt(4), 100)
0

julia> @allocated bar(StaticInt(3), StaticInt(4), 100)
0

julia> @allocated bar(StaticInt(4), StaticInt(4), 100)
0

julia> @allocated bar(StaticInt(5), StaticInt(4), 100)
0

julia> @allocated bar(StaticInt(2), StaticInt(5), 100)
0

julia> @allocated bar(StaticInt(2), StaticInt(6), 100)
0

julia> @allocated bar(StaticInt(2), StaticInt(7), 100)
0

julia> @allocated bar(StaticInt(2), StaticInt(8), 100)
0

julia> @allocated bar(StaticInt(2), StaticInt(9), 100)
0

julia> using MethodAnalysis

julia> mis = methodinstances(foo)
1-element Vector{Core.MethodInstance}:
MethodInstance for foo(::PtrArray{Tuple{Int64, Int64}, (true, true), Float64, 2, 1, 0, (1, 2), Tuple{Int64, Int64}, Tuple{StaticInt{1}, StaticInt{1}}})
``````

A rundown of what `bar` is doing:

1. Creates a random array. If the dimensions are `StaticInt`s, this array is statically sized and will (given certain conditions) be stack allocated.
2. It makes the array dynamic, by discarding the static size and and stride information. The static tuple information remains.
3. Repeatedly calls `foo`. `foo` is not inlined (`@noinline`), and `foo` also mutates the array.

However, the `@gc_preserve` macro (which is still primitive, it doesn’t actually descend/walk expressions…) `GC.@preserve`s the memory backing the array and sheds the last bit of static tuple size info, before passing the array to `foo`.

Thus, `@allocated` shows `0`.
After trying a variety of different sizes, I call `methodinstances` to confirm that only a single `foo` method has been compiled, because the arrays are indeed dynamically sized.

3 Likes

I’m not sure if it’s the same thing happening here, but we found very similarly bad compilation times from our implementation of FixedSizeStrings (quite similar to JC’s JuliaComputing/FixedSizeStrings.jl), that turned out to be caused by a bad interaction between julia’s codegen and LLVM unrolling some operations over the static array into very, very large functions.

You should try the above experiments with `-O1` and `-O0` to see if the problem disappears. If it does, this is a repeat of the same problem we saw.

@jeff.bezanson has been looking into the issue in LLVM’s optimization passes themselves, but I’m not sure where that’s being tracked. Last I heard there wasn’t anything obvious sticking out, unfortunately.

We ended up doing a rewrite of our FixedSizeStrings that backs them with a bunch of variably sized primitive integers, all the way up to 2048-bit integers (!), by generating a bunch of these in a loop: `@eval primitive type Str\$N end`.

This actually did massively improve our compilation times, because it seems to be enough to prevent LLVM from unrolling operations on them (). I’ll try to write up a small blog post explaining what we did, in case it’s helpful. Maybe StaticArrays could do the same?

3 Likes