Why is SVector faster than MVector

Volker_Weissmann · February 12, 2021, 9:12pm

Why are these nearly identical function not compiled into the same llvm ir?

using StaticArrays
using BenchmarkTools
function funcS()
    ret = SVector(0,0)
    ret = ret + SVector(123, 321)
    return ret
end
function funcM()
    ret = MVector(0,0)
    ret += MVector(123, 321) 
    return ret
end
@benchmark funcS()
@benchmark funcM()

julia> @benchmark funcS()
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     0.013 ns (0.00% GC)
  median time:      0.015 ns (0.00% GC)
  mean time:        0.017 ns (0.00% GC)
  maximum time:     0.034 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark funcM()
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     5.457 ns (0.00% GC)
  median time:      6.470 ns (0.00% GC)
  mean time:        8.414 ns (8.61% GC)
  maximum time:     1.427 μs (99.17% GC)
  --------------
  samples:          10000
  evals/sample:     1000

Mason · February 12, 2021, 9:13pm

MVector is mutable and escapes the function, so it gets heap allocated which is slower than a stack allocation like you’d get with a SVector.

Volker_Weissmann · February 12, 2021, 9:15pm

Why isn’t the MVector stack allocated? If I do

int ar[2];

in C, it is also a stack allocation. What julia code do I need to write if I want the equivalent of the C code above?

Mason · February 12, 2021, 9:23pm

I don’t know C so I can’t answer that. MVectors are not stack allocated because they are mutable, and mutable data structures only get stack allocated when Julia’s optimizer can see the entire lifetime of the data, so it can’t be allowed to escape the function body (modulo inlining).

The usual trick people use is to convert a SVector to MVector, mutate it, then convert back to SVector. If the compiler is in a good mood, this should be stack allocated.

Here’s an example:

using StaticArrays
using BenchmarkTools

function guncS(v)
    SVector{2}(v[1], v[2] + 1)
end

function guncM(v)
    mv = MVector(v)
    mv[2] += 1 # Mutation on the stack!
    SVector(mv)
end


let v = Ref(SVector(1, 2))
    @btime guncS($v[])
    @btime guncM($v[])
end;

#+RESULTS:
:   1.329 ns (0 allocations: 0 bytes)
:   1.329 ns (0 allocations: 0 bytes)

Note that I used Ref here to stop the function call from being hoisted out of the benchmark loop. Your funcS never actually ran because the compiler realized it was a compile time constnat. (Your CPU can’t do anything faster than a nanosecond)

Volker_Weissmann · February 12, 2021, 9:24pm

Ok, thank you

Oscar_Smith · February 12, 2021, 9:25pm

The reason Julia has to put ret on the heap is that you are returning it. In C, if you wrote

int* bad(void){
   int ar[2];
   return &ar
}

you would get a segfault since objects on the stack are dealocated when they go out of scope.

Volker_Weissmann · February 12, 2021, 9:26pm

Yeah, makes sense.
Nitpicking: It’s UB and it is unlikely that you get a segfault here. But reading garbage values would be realistic.

Mason · February 12, 2021, 9:28pm

To expand on this: semantically julia is designed so that everything acts as if it were heap allocated. Stack allocation is an optimization, and the compiler is designed to be able to not do certain optimizations if it decides so.

There’s a lot of thought and discussion right now going into questions like “how can we get better guarantees that the compiler will do certain optimizations we want”. Currently, the answer really just is that you can’t guarantee it, and all of these things are allowed to change between versions. This can be a blessing and a curse.

In practice, I think this really isn’t much of a problem for anyone but people with real-time requirements, and even they can deal with it if they’re careful enough, but Julia is not really an ideal language for such systems (yet).

Mason · February 13, 2021, 1:32am

A post from @Elrod in another thread reminded me: StrideArrays.jl is well worth looking into for stack allocated arrays that automate this sort of process: https://chriselrod.github.io/StrideArrays.jl/dev/stack_allocation/

Elrod · February 13, 2021, 2:12am

Long term, Keno has some revolutionary work planned on that front that will apply to regular arrays.
I’m not sure what the time horizon for that is though, e.g. he’s busy revolutionizing AD at the moment.

One of the problems in Julia compared to something like Fortran that makes this harder is that our optimization units are smaller.
If you’re compiling a huge Fortran program at once, the compiler can see that sure an array escapes the function that created it, passing it to another function, which it escapes from, which takes a slice that escapes, which…but then that one doesn’t escape, so it’s safe to stack allocate everything!

To add to the StrideArrays documentation, the unit tests have another example:

julia> using StrideArrays, Test

julia> @noinline function dostuff(A; B = I)
           StrideArrays.@avx for i in eachindex(A)
               A[i] += 1
           end
           C = A * B
           s = zero(eltype(C))
           StrideArrays.@avx for i in eachindex(C)
               s += C[i]
           end
           s
       end
dostuff (generic function with 1 method)

julia> function gc_preserve_test()
           A = @StrideArray rand(8,8);
           B = @StrideArray rand(8,8);
           @gc_preserve dostuff(A, B = B)
       end
gc_preserve_test (generic function with 1 method)

julia> @test iszero(@allocated gc_preserve_test())
Test Passed

julia> gc_preserve_test()
374.2746212074998

We create two random matrices, pass them to a function that isn’t inlined, mutate one of them, multiply them, and then sum the result – all without allocations.

The key is the @gc_preserve macro (which could be made a lot smarter than it is now; currently it only applies to a single function call), which basically just replaces all the arrays in the function call with PtrArrays – pointers + some more information like size and strides.
This stops the arrays from escaping – they don’t get passed to the function – so they can be stack allocated. The macro uses GC.@preserve to make sure that the arrays are nonetheless protected in the meantime, so it’s still GC-safe.

You can think of the macro as like a promise that the arrays don’t escape. If you do something like assign them to a global or push! them into a vector, you’ll get garbage.

Oscar_Smith · February 13, 2021, 2:30am

Can we just clone Keno? It would solve a lot of our problems.

Volker_Weissmann · February 13, 2021, 3:49pm

What is Keno?

js135005 · February 13, 2021, 4:01pm

Wrong question, You should have asked “Who is Keno?”

carstenbauer · February 13, 2021, 4:07pm

Topic		Replies	Views
Can Julia optimize mutable static arrays to be allocated on the stack? General Usage	9	1709	February 17, 2023
Allocations when pushing to SVector in a loop General Usage	4	542	February 6, 2019
Fortran vs Julia stack allocated arrays Performance array , memory-allocation	18	6523	November 29, 2018
StaticArrays and allocations General Usage	27	2272	May 31, 2021
How does SVector manage memory? New to Julia	7	583	December 18, 2019

Why is SVector faster than MVector

Related topics