Can Julia optimize mutable static arrays to be allocated on the stack?

According to the documentation of StaticArrays.jl, mutable static arrays are implemented as mutable structs and are allocated on the heap even though their sizes are known at compile time. Is Julia’s compiler able to detect cases when it’s safe to allocate on the stack instead? (I suppose this must be possible theoretically, but wonder if this feature has been / will be implemented.)

1 Like

Yes.

julia> using StaticArrays, BenchmarkTools

julia> if VERSION >= v"1.7.0-beta"
         @inline exp_fast(x) =  Base.Math.exp_impl_fast(x, Val(:ℯ))
       else
         exp_fast(x) = exp(x)
       end
exp_fast (generic function with 1 method)

julia> function alloctest(x)
         y = MVector(x)
         @inbounds @simd ivdep for i ∈ eachindex(y)
           y[i] = Base.Math.exp_impl_fast(y[i], Val(:ℯ))
         end
         s = zero(eltype(y))
         @fastmath for i ∈ eachindex(y)
           s += y[i]
         end
         s
       end
alloctest (generic function with 1 method)

julia> x = @SVector rand(32);

julia> @btime alloctest($x)
  11.995 ns (0 allocations: 0 bytes)
56.18908775961786

The assembly confirms that y is in fact stack allocated (loads and stores use rsp, the 64-bit-mode stack pointer).
Note that in many cases, the MArray will not be allocated at all, existing only in the CPU’s registers if at all.

6 Likes

For some reasons, it still allocates when the index is not known at compile time.

julia> function wat(char)
           buf = @MVector [0,0,0]
           buf[char - 'a' + 1] = 1
           return 0
       end

julia> @time wat('a');
  0.000004 seconds (1 allocation: 32 bytes)

Huh.

julia> using StaticArrays

julia> function wat(char)
           buf = @MVector [0,0,0]
           buf[char - 'a' + 1] = 1
           return 0
       end
wat (generic function with 1 method)

julia> @time wat('a');
  0.000000 seconds

julia> @time wat('a');
  0.000000 seconds

Maybe you could try @btime?

This appears to be fixed in Julia 1.9.0-alpha1.

Julia 1.8.0
julia> using StaticArrays, BenchmarkTools

julia> function wat(char)
           buf = @MVector [0,0,0]
           buf[char - 'a' + 1] = 1
           return 0
       end
wat (generic function with 1 method)

julia> @time wat('a');
  0.000005 seconds (1 allocation: 32 bytes)

julia> @time wat('a');
  0.000002 seconds (1 allocation: 32 bytes)

julia> @btime wat('a');
  12.813 ns (1 allocation: 32 bytes)

julia> versioninfo()
Julia Version 1.8.0
Commit 5544a0fab7 (2022-08-17 13:38 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 20 × 12th Gen Intel(R) Core(TM) i9-12900HK
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, goldmont)
  Threads: 1 on 20 virtual cores
Julia 1.8.3
julia> using StaticArrays, BenchmarkTools

julia> function wat(char)
           buf = @MVector [0,0,0]
           buf[char - 'a' + 1] = 1
           return 0
       end
wat (generic function with 1 method)

julia> @time wat('a');
  0.000007 seconds (1 allocation: 32 bytes)

julia> @time wat('a');
  0.000002 seconds (1 allocation: 32 bytes)

julia> @btime wat('a');
  8.208 ns (1 allocation: 32 bytes)

julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161 (2022-11-14 20:14 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 20 × 12th Gen Intel(R) Core(TM) i9-12900HK
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, goldmont)
  Threads: 1 on 20 virtual cores
Julia 1.8.3 w/ -Cskylake option
julia> using StaticArrays, BenchmarkTools

julia> function wat(char)
           buf = @MVector [0,0,0]
           buf[char - 'a' + 1] = 1
           return 0
       end
wat (generic function with 1 method)

julia> @time wat('a');
  0.000006 seconds (1 allocation: 32 bytes)

julia> @time wat('a');
  0.000002 seconds (1 allocation: 32 bytes)

julia> @btime wat('a');
  7.800 ns (1 allocation: 32 bytes)

julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161 (2022-11-14 20:14 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 20 × 12th Gen Intel(R) Core(TM) i9-12900HK
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, goldmont)
  Threads: 1 on 20 virtual cores
Julia 1.9.0-alpha1
julia> using StaticArrays, BenchmarkTools

julia> function wat(char)
           buf = @MVector [0,0,0]
           buf[char - 'a' + 1] = 1
           return 0
       end
wat (generic function with 1 method)

julia> @time wat('a');
  0.000002 seconds

julia> @time wat('a');
  0.000001 seconds

julia> @btime wat('a');
  4.600 ns (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.9.0-alpha1
Commit 0540f9d739 (2022-11-15 14:37 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 20 × 12th Gen Intel(R) Core(TM) i9-12900HK
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, alderlake)
  Threads: 1 on 20 virtual cores
2 Likes

Who hasn’t been running https://github.com/JuliaLang/julia/pull/47184 for the past couple months? :wink:

4 Likes

It’s been merged to master! :tada:

1 Like

Is there somewhere someone describing the impacts of this PR on the current behavior of Julia ? I have troubles unstanding what it really means, and you guys seem to be thrilled so i wander :slight_smile:

the TLDR is that the Julia compilation pipeline has a few steps.

  1. parsing/lowering
  2. type inference and julia IR optimization (e.g. inlining)
  3. LLVM optimizations and code generation.

Prior to https://github.com/JuliaLang/julia/pull/47184 precompiling only saved 1 and 2. This PR makes it so we save step 3 as well which improves responsiveness (sometimes dramatically).

6 Likes

Sorry for jumping in here after a couple months but I discovered something pertinent. The example method indicated won’t allocate even on earlier versions (at least the Julia v1.8.5 I’m running) if you add an @inbounds to it:

julia> using StaticArrays, BenchmarkTools

julia> function wat(char)                                                                                                         
           buf = @MVector [0,0,0]                                                                                                 
           @inbounds buf[char - 'a' + 1] = 1                                                                                      
           return 0                                                                                                               
       end
wat (generic function with 1 method)

julia> @btime wat('a');
  3.400 ns (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65e (2023-01-08 06:45 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 24 × AMD Ryzen 9 5900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver3)
  Threads: 24 on 24 virtual cores

I discovered this because based on this conversation, I tried the code I was working on in 1.9.0-beta4 and it still allocated. I tracked it down to a loop and added @inbounds and it stopped allocating. Then tried it back in 1.8.5 and it still didn’t.

So, if your MVectors are allocating, check if some @inbounds can help.

2 Likes