Julia's assignment behavior differs from Fortran?

FWIW, Fortran compilers normally don’t have a problem putting all the arrays you’re mutating on the stack. For example, see the gfortran flag -fstack-arrays.
In Julia, StaticArrays.MArrays will also generally be placed on the stack, so long as they do not escape the function creating them or get passed as an argument to a function that isn’t inlined.

A feature request I have in Julia (maybe I’ll make a PR) is support for alloca, so it’s easier to stack allocate memory that’s freed when the function returns, but that you can pass as an argument to non-inlined functions. llvmcall doesn’t work for this, because the memory is immediately freed when the llvmcall returns, not when the Julia function calling llvmcall returns.

julia> using StaticArrays, BenchmarkTools

julia> function f()
         s = 0.
         for i in 1:1000
           x = SVector{3,Float64}(i, sqrt(i), i^2)
           for j in 1:3
             s = s + x[j]
           end
         end
         s
       end
f (generic function with 1 method)

julia> function g()
         s = 0.
         for i in 1:1000
           x = MVector{3,Float64}(undef)
           x[1] = i
           x[2] = sqrt(i)
           x[3] = i^2
           for j in 1:3
             s = s + x[j]
           end
         end
         s
       end
g (generic function with 1 method)

julia> @btime f()
  2.605 μs (0 allocations: 0 bytes)
3.343550974558874e8

julia> @btime g()
  2.605 μs (0 allocations: 0 bytes)
3.343550974558874e8

However, SVectors are easier for the compiler; using the MVector can prevent some optimizations; see what happens when we add @fastmath:

julia> @fastmath function f()
         s = 0.
         for i in 1:1000
           x = SVector{3,Float64}(i, sqrt(i), i^2)
           for j in 1:3
             s = s + x[j]
           end
         end
         s
       end
f (generic function with 1 method)

julia> @fastmath function g()
         s = 0.
         for i in 1:1000
           x = MVector{3,Float64}(undef)
           x[1] = i
           x[2] = sqrt(i)
           x[3] = i^2
           for j in 1:3
             s = s + x[j]
           end
         end
         s
       end
g (generic function with 1 method)

julia> @btime f()
  744.343 ns (0 allocations: 0 bytes)
3.3435509745588744e8

julia> @btime g()
  2.604 μs (0 allocations: 0 bytes)
3.343550974558874e8

Suddenly f (using SVectors) is much faster, but the speed of g() is almost unchanged.

But my point here is that, just because the MVector is mutable and getindex and setindex! are defined using loads and stores to/from their pointers doesn’t mean it has to be heap allocated.

EDIT:
The simplest way to speed up g is to use @inbounds:

julia> @fastmath function h()
         s = 0.
         @inbounds for i in 1:1000
           x = MVector{3,Float64}(undef)
           x[1] = i
           x[2] = sqrt(i)
           x[3] = i^2
           for j in 1:3
             s = s + x[j]
           end
         end
         s
       end
h (generic function with 1 method)

julia> @btime h()
  744.317 ns (0 allocations: 0 bytes)
3.3435509745588744e8

A less simple way is to manually unroll the inner loop:

julia> @fastmath function g()
         s = 0.
         for i in 1:1000
           x = MVector{3,Float64}(undef)
           x[1] = i
           x[2] = sqrt(i)
           x[3] = i^2
           Base.Cartesian.@nexprs 3 j -> begin
             s = s + x[j]
           end
         end
         s
       end
g (generic function with 1 method)

julia> @btime g()
  744.324 ns (0 allocations: 0 bytes)
3.3435509745588744e8

I guess it’s close to a limit where the compiler gives up optimizing it.

EDIT:
It just occurred to me that AVX512 is probably needed for the @fastmath versions of this function to be fast, because AVX512 is needed for SIMD Int64 -> Float64 conversion.
If you don’t have AVX512, I’d recommend

julia> @fastmath function h2()
         s = 0.; i = 0.0
         @inbounds for _ in 1:1000
           x = MVector{3,Float64}(undef);
           i += 1
           x[1] = i
           x[2] = sqrt(i)
           x[3] = i^2
           for j in 1:3
             s = s + x[j]
           end
         end
         s
       end
h2 (generic function with 1 method)

julia> @btime h2()
  711.417 ns (0 allocations: 0 bytes)
3.3435509745588744e8
5 Likes