Fortran vs Julia stack allocated arrays

array
memory-allocation

#1

I ran into the following HN comment:

But to say something nice about Fortran: dynamically-sized-mutable-stack-allocated-arrays. (-fstack-arrays)

In Julia, stack-allocated objects are immutable, and (partly for that reason) also generally very small. You wouldn’t want to create a new 30x30 array every time you want to edit a single value. You also can’t get stack pointers, meaning you can’t use masked load/store operations to vectorize code when the array dimensions aren’t a multiple of SIMD-vector-width.

This means to write fast code, I normally heap allocate everything. And to avoid triggering the GC, that means keeping the same arrays alive, and avoiding any temporaries.

With Fortran, you don’t have to worry about any of that. You can always write convenient and clear code, and it’s likely to be very fast. If you hold vectorization in mind while laying out your data and computations, compilers will normally do a great job figuring it out.

Why does julia have this drawback if fortran doesn’t (requiring immutable fixed static arrays for stack allocation) and can/will it change?


#2

No.

And no.

Well, this is true in the useless sense that the concept of stack pointer doesn’t exist in julia. It’s wrong in that you can definately get pointers to objects that’s allocated on the stack.

Wrong. You can.


#3

So how does one create a stack allocated mutable array in Julia?


#4

I believe there’s mutable array from StaticArrays


#5

MArray is mostly heap allocated :frowning:

I don’t even know how to get this by llvmcall (in case of statically known dimensions). I asked in https://discourse.julialang.org/t/mutate-ntuple-at-position/17682.


#6

Isn’t that heap allocated?
I do not know of any stack-allocated objects in Julia that are mutable. I thought Julia drew a neat divide between immutable structs and mutable structs, where only some of the former are stack allocated (ie, when they’re isbits/don’t contain any reference types).

They’re just a mutable struct wrapping a tuple. Create and destroy them, and you trigger the garbage collector.

julia> using StaticArrays, BenchmarkTools

julia> @benchmark @MVector randn(4)
BenchmarkTools.Trial: 
  memory estimate:  48 bytes
  allocs estimate:  1
  --------------
  minimum time:     25.867 ns (0.00% GC)
  median time:      27.715 ns (0.00% GC)
  mean time:        37.415 ns (22.50% GC)
  maximum time:     48.765 μs (99.92% GC)
  --------------
  samples:          10000
  evals/sample:     997

julia> @benchmark @SVector randn(4)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     21.204 ns (0.00% GC)
  median time:      22.369 ns (0.00% GC)
  mean time:        22.478 ns (0.00% GC)
  maximum time:     52.456 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997

julia> @benchmark @MVector zeros(4)
BenchmarkTools.Trial: 
  memory estimate:  48 bytes
  allocs estimate:  1
  --------------
  minimum time:     4.744 ns (0.00% GC)
  median time:      8.484 ns (0.00% GC)
  mean time:        18.205 ns (47.74% GC)
  maximum time:     49.892 μs (99.97% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark @SVector zeros(4)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     0.020 ns (0.00% GC)
  median time:      0.030 ns (0.00% GC)
  mean time:        0.029 ns (0.00% GC)
  maximum time:     26.590 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

pointer_from_objref works just fine on them, as does unsafe_load and unsafe_store!. In fact, setindex! is defined using pointer_from_objref and unsafe_store!.

I made the HN comment because neither foobar’s question above, or my question on masked loads and stores had a positive answer.
Neither of us could even achieve what we wanted via llvmcall.

So if this is possible:

meaning you can’t use masked load/store operations to vectorize code when the array dimensions aren’t a multiple of SIMD-vector-width.

I’d love to learn how!


#7

No. It’s mutable and that’s it. If you pass the object to a non-inline function that might force it to be heap allocated but doing local operation on it will not make it heap allocated.

Stack allocation is a property of the object, mutability is a property of the type.

The benchmark result shows what is allocated when you return them, not when you just use them locally.


#8

here is a small counter example

julia> using StaticArrays, BenchmarkTools

julia> function f(x)
           A = @MVector zeros(typeof(x), 5)
           for i in eachindex(A)
               A[i] = x 
           end
           reduce(+, A)
       end
f (generic function with 1 method)

julia> @btime f(3)
  4.646 ns (0 allocations: 0 bytes)
15

julia> @btime f(4.0)
  5.027 ns (0 allocations: 0 bytes)
20.0

#9

For those like me wondering how on earth non-allocating mutable arrays are possible in pure Julia, it turns out that it isn’t. StaticArrays is cheating. It stores the data as a tuple, which ought to be immutable.

mutable struct MArray{S <: Tuple, T, N, L} <: StaticArray{S, T, N}
    data::NTuple{L,T}
    ...
end

julia> v = MVector(1,2,3)
3-element MArray{Tuple{3},Int64,1,3}:
 1
 2
 3

julia> dump(v)
MArray{Tuple{3},Int64,1,3}
  data: Tuple{Int64,Int64,Int64}
    1: Int64 1
    2: Int64 2
    3: Int64 3

But then mutates it unsafely:

@inline function setindex!(v::MArray, val, i::Int)
    @boundscheck checkbounds(v,i)
    T = eltype(v)

    if isbitstype(T)
        GC.@preserve v unsafe_store!(Base.unsafe_convert(Ptr{T}, pointer_from_objref(v)), convert(T, val), i)
    else

#10

Thanks @Evizero for this example. One thing where Fortran is still at an advantage is that you can have stack-allocated arrays whose size is only known at runtime. Expanding on @Evizero’s example, the best I can do is

function f(x, ::Val{n}) where {n}
   A = @MVector zeros(typeof(x), n)
   for i in eachindex(A)
      A[i] = x 
   end
   reduce(+, A)
end

f(x, n::Integer) = f(x, Val(n))

which is indeed allocation-free:

julia> @btime f(3.0)
3.036 ns (0 allocations: 0 bytes)
15.0

julia> @btime f(3.0, 5)
2.940 ns (0 allocations: 0 bytes)
15.0

julia> @btime f(3.0, 10)
4.405 ns (0 allocations: 0 bytes)
30.0

but of course this will compile a new version of f for each new value of n. I’d be very happy to see dynamically-sized stack-allocated arrays in Julia.


#11

No, it’s totally safe. It’s NOT mutating a immutable. It’s mutating the mutable.


#12

Yes, that’s right. And the reason julia doesn’t have this (yet) is because we don’t have variable size objects. With that, one additional issue with stack allocating runtime sized array is stack overflow. I’m not sure if people will be happier if the array starts allocating past a certain size or if they get stack overflow.


#13

Right, large arrays shouldn’t go on the stack. Fortran gets away with it because the documentation of the application often says to please issue “ulimit -s unlimited” before running the code…


#14

I think that for large objects, there is not much advantage to stack-allocation. I mean, the advantages are: hot cache, processor stack engine, no alloc/gc overhead.

If the object is small, then 10 cycles for object creation can make quite a difference.

Alloc overhead for large objects is irrelevant. I think we already free objects when their lifetime is known by the compiler, instead of waiting on expensive gc?

But thanks for the explanation, I’ll need to play around more in order to see how to reliably get mutable stack-allocated objects. “Use MVector and force inlining” sounds like a good alternative to SVector.


#15

I’m sure it’s safe in practice, because compiler experts like yourself have reviewed that code, but “you can safely mutate tuples that are part of mutable structs” is certainly not part of the language semantics that users are told about.

GC.@preserve v unsafe_store!(Base.unsafe_convert(Ptr{T}, pointer_from_objref(v)), convert(T, val), i)

“Don’t do this at home, or your cat may segfault”


#16

What is the point you are trying to make? All these constructs are available to everyone, even if they are on a more advanced level.

  • If you are saying that its not easy to define your own mutable static arrays, I am inclined to agree. But StaticArrays.jl exists and is easy to use.

  • If you are saying that its hard to define your own mutable types that behave similarly then I am inclined to disagree

julia> using BenchmarkTools

julia> mutable struct Foo
           x::Int
           y::Int
       end

julia> function f(x)
           A = Foo(0, 2x)
           A.x = 3x
           A.x + A.y
       end
f (generic function with 1 method)

julia> @btime f(3)
  1.961 ns (0 allocations: 0 bytes)
15

julia> @code_warntype f(3)
Body::Int64
2 1 ─ %1 = (Base.mul_int)(2, x)::Int64                                                                                                                 │╻ *
3 │   %2 = (Base.mul_int)(3, x)::Int64                                                                                                                 │╻ *
  └──      goto #3 if not false                                                                                                                        │ 
  2 ─      nothing                                                                                                                                     │ 
4 3 ─ %5 = (Base.add_int)(%2, %1)::Int64                                                                                                               │╻ +
  └──      return %5 

#17

No, there’s nothing that’s mutating “tuples that are part of mutable structs”. You are mutating the mutable structs. Just look at that code, it never mention even once the field name so it never even accessed the field. It gets a pointer to the mutable struct so it’s just mutating that.

Of course since this use pointers it’s not part of the safe subset of the language and what you can do with that pointer is identical to what you are allowed to do with an object pointer in C and you need to consult the ABI and C API to know what you could do. It’s not in the (julia) language semantics in the sense that I usually use it but it’s nonetheless well defined and documented.


#18

You’re absolutely right, I hadn’t paid enough attention. Thank you for taking the time to explain it in detail. It’s an interesting optimization.


#19

Great example! We can even return the array by turning it into a static array once we’re done mutating it.

julia> using StaticArrays, BenchmarkTools

julia> function f(x)
           A = @MVector zeros(typeof(x), 4)
           for i in eachindex(A)
               A[i] = x 
           end
           SVector(A)
       end
f (generic function with 1 method)

julia> rx = Ref(2.3);

julia> (@SVector fill(2.3, 4)) === f(rx[])
true

julia> @benchmark @SVector fill(($rx)[], 4)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.274 ns (0.00% GC)
  median time:      1.282 ns (0.00% GC)
  mean time:        1.358 ns (0.00% GC)
  maximum time:     19.759 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark f(($rx)[])
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.274 ns (0.00% GC)
  median time:      1.281 ns (0.00% GC)
  mean time:        1.313 ns (0.00% GC)
  maximum time:     12.268 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

Unfortunately, this slightly more complicated example does not work:

julia> using SIMD, StaticArrays, BenchmarkTools

julia> using SIMD: VE

julia> function mul(v::SVector{3,T}, x::T) where {N,T}
           mv = MVector(v)
           mask = Vec{4,Bool}((VE(true),VE(true),VE(true),VE(false)))
           ptr_mv =  Base.unsafe_convert(Ptr{T}, pointer_from_objref(mv))
           v4 = vload(Vec{4,Float64}, ptr_mv, mask)
           vstore(v4 * x, ptr_mv, mask)
           SVector(mv)
       end
mul (generic function with 1 method)

julia> v3 = @SVector randn(3);

julia> @benchmark mul($v3, 2.0)
BenchmarkTools.Trial: 
  memory estimate:  64 bytes
  allocs estimate:  2
  --------------
  minimum time:     30.117 ns (0.00% GC)
  median time:      30.872 ns (0.00% GC)
  mean time:        40.406 ns (20.03% GC)
  maximum time:     40.394 μs (99.89% GC)
  --------------
  samples:          10000
  evals/sample:     994

EDIT: Also worth pointing out the above test was on a Haswell computer, which I don’t believe has masked instructions? At least, the assembly looked like:

	vmovapd	(%rsi), %ymm0
	vmaskmovpd	(%rcx), %ymm0, %ymm1
	vxorpd	%xmm2, %xmm2, %xmm2
	vblendpd	$8, %ymm2, %ymm1, %ymm1 # ymm1 = ymm1[0,1,2],ymm2[3]

instead of

	vmovupd	(%rax), %ymm0 {%k1} {z}