FWIW, Fortran compilers normally don’t have a problem putting all the arrays you’re mutating on the stack. For example, see the gfortran flag -fstack-arrays.
In Julia, StaticArrays.MArrays
will also generally be placed on the stack, so long as they do not escape the function creating them or get passed as an argument to a function that isn’t inlined.
A feature request I have in Julia (maybe I’ll make a PR) is support for alloca
, so it’s easier to stack allocate memory that’s freed when the function returns, but that you can pass as an argument to non-inlined functions. llvmcall
doesn’t work for this, because the memory is immediately freed when the llvmcall returns, not when the Julia function calling llvmcall returns.
julia> using StaticArrays, BenchmarkTools
julia> function f()
s = 0.
for i in 1:1000
x = SVector{3,Float64}(i, sqrt(i), i^2)
for j in 1:3
s = s + x[j]
end
end
s
end
f (generic function with 1 method)
julia> function g()
s = 0.
for i in 1:1000
x = MVector{3,Float64}(undef)
x[1] = i
x[2] = sqrt(i)
x[3] = i^2
for j in 1:3
s = s + x[j]
end
end
s
end
g (generic function with 1 method)
julia> @btime f()
2.605 μs (0 allocations: 0 bytes)
3.343550974558874e8
julia> @btime g()
2.605 μs (0 allocations: 0 bytes)
3.343550974558874e8
However, SVector
s are easier for the compiler; using the MVector
can prevent some optimizations; see what happens when we add @fastmath
:
julia> @fastmath function f()
s = 0.
for i in 1:1000
x = SVector{3,Float64}(i, sqrt(i), i^2)
for j in 1:3
s = s + x[j]
end
end
s
end
f (generic function with 1 method)
julia> @fastmath function g()
s = 0.
for i in 1:1000
x = MVector{3,Float64}(undef)
x[1] = i
x[2] = sqrt(i)
x[3] = i^2
for j in 1:3
s = s + x[j]
end
end
s
end
g (generic function with 1 method)
julia> @btime f()
744.343 ns (0 allocations: 0 bytes)
3.3435509745588744e8
julia> @btime g()
2.604 μs (0 allocations: 0 bytes)
3.343550974558874e8
Suddenly f
(using SVector
s) is much faster, but the speed of g()
is almost unchanged.
But my point here is that, just because the MVector
is mutable and getindex
and setindex!
are defined using loads and stores to/from their pointers doesn’t mean it has to be heap allocated.
EDIT:
The simplest way to speed up g
is to use @inbounds
:
julia> @fastmath function h()
s = 0.
@inbounds for i in 1:1000
x = MVector{3,Float64}(undef)
x[1] = i
x[2] = sqrt(i)
x[3] = i^2
for j in 1:3
s = s + x[j]
end
end
s
end
h (generic function with 1 method)
julia> @btime h()
744.317 ns (0 allocations: 0 bytes)
3.3435509745588744e8
A less simple way is to manually unroll the inner loop:
julia> @fastmath function g()
s = 0.
for i in 1:1000
x = MVector{3,Float64}(undef)
x[1] = i
x[2] = sqrt(i)
x[3] = i^2
Base.Cartesian.@nexprs 3 j -> begin
s = s + x[j]
end
end
s
end
g (generic function with 1 method)
julia> @btime g()
744.324 ns (0 allocations: 0 bytes)
3.3435509745588744e8
I guess it’s close to a limit where the compiler gives up optimizing it.
EDIT:
It just occurred to me that AVX512 is probably needed for the @fastmath
versions of this function to be fast, because AVX512 is needed for SIMD Int64 -> Float64
conversion.
If you don’t have AVX512, I’d recommend
julia> @fastmath function h2()
s = 0.; i = 0.0
@inbounds for _ in 1:1000
x = MVector{3,Float64}(undef);
i += 1
x[1] = i
x[2] = sqrt(i)
x[3] = i^2
for j in 1:3
s = s + x[j]
end
end
s
end
h2 (generic function with 1 method)
julia> @btime h2()
711.417 ns (0 allocations: 0 bytes)
3.3435509745588744e8