Note that creating an MArray and then passing it to another function will generally cause it to allocate memory. This means that for best performance, either
- Preallocate the MArray, and keep reusing it.
- Make sure everything gets inlined while using it.
Once upon a time, I recall operations with two MArrays
producing SArrays
to avoid allocating. But that no longer seems to be the case:
julia> using StaticArrays
julia> x = @MVector randn(3); x2 = @MVector randn(3);
julia> y = @SVector randn(3);
julia> typeof(x - x2)
MArray{Tuple{3},Float64,1,3}
julia> typeof(x - y)
MArray{Tuple{3},Float64,1,3}
julia> typeof(y - x)
SArray{Tuple{3},Float64,1,3}
Also, note that you can copy and paste the above code into a REPL and it will run. The "julia> " prompts will automatically disappear so long as your text starts with a "julia> ".
Unfortunately, a lot of innocent looking functions are not inlined. For example, *
inside r
is not, therefore you get allocations for creating an MVector with x - ξ
.
julia> using BenchmarkTools, StaticArrays
julia> x = @MVector randn(3); y = @SVector randn(3);
julia> function r(x, ξ)
((x-ξ)'*(x-ξ))^0.5
end
r (generic function with 1 method)
julia> @benchmark r($x, $y)
BenchmarkTools.Trial:
memory estimate: 80 bytes
allocs estimate: 3
--------------
minimum time: 49.321 ns (0.00% GC)
median time: 51.471 ns (0.00% GC)
mean time: 69.426 ns (22.39% GC)
maximum time: 71.816 μs (99.85% GC)
--------------
samples: 10000
evals/sample: 987
julia> @benchmark r($y, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 9.487 ns (0.00% GC)
median time: 9.504 ns (0.00% GC)
mean time: 9.537 ns (0.00% GC)
maximum time: 22.032 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 999
julia> @generated function my_sdot(x::Union{SArray{S,T,N,L},MArray{S,T,N,L}},y::Union{SArray{S,T,N,L},MArray{S,T,N,L}}) where {S,T,N,L}
quote
$(Expr(:meta, :inline))
@fastmath $(Expr(:call, :+, [:(x[$l]*y[$l]) for l in 1:L]...))
end
end
my_sdot (generic function with 1 method)
julia> function rv2(x, ξ)
δ = (x-ξ)
my_sdot(δ, δ) |> sqrt
end
rv2 (generic function with 1 method)
julia> @benchmark rv2($x, $y)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 8.302 ns (0.00% GC)
median time: 8.312 ns (0.00% GC)
mean time: 8.326 ns (0.00% GC)
maximum time: 20.302 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 999
julia> @benchmark rv2($y, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 8.307 ns (0.00% GC)
median time: 8.318 ns (0.00% GC)
mean time: 8.335 ns (0.00% GC)
maximum time: 26.754 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 999
The @code_native
was basically the same for both versions. It also interestingly did not include any checks for domain errors on square root: LLVM realized it was lower bounded at zero. Cool.
It can take some playing around. Ideally, you could use a more reasonable dot
function. My initial version was actually just a for loop:
julia> @inline function my_inlined_dot(x, y)
out = zero(promote_type(eltype(x), eltype(y)))
for i in eachindex(x,y)
out += x[i] * y[i]
end
out
end
my_inlined_dot (generic function with 1 method)
julia> function rv3(x, ξ)
δ = (x-ξ)
my_inlined_dot(δ, δ) |> sqrt
end
rv3 (generic function with 1 method)
julia> @benchmark rv3($x, $y)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 30.698 ns (0.00% GC)
median time: 30.706 ns (0.00% GC)
mean time: 30.778 ns (0.00% GC)
maximum time: 42.336 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 994
julia> @benchmark rv3($y, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 8.304 ns (0.00% GC)
median time: 8.326 ns (0.00% GC)
mean time: 8.356 ns (0.00% GC)
maximum time: 120.015 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 999
I don’t totally understand what happened there. LLVM did not realize the x-y
version of the above was lower bounded at zero, and @code_native
showed this code:
vucomisd %xmm0, %xmm1
;}
ja L89
; Location: math.jl:493
vsqrtsd %xmm0, %xmm0, %xmm0
;}}
addq $24, %rsp
retq
; Function |>; {
; Location: operators.jl:813
; Function sqrt; {
; Location: math.jl:492
L89:
movabsq $throw_complex_domainerror, %rax
movabsq $140070075914192, %rdi # imm = 0x7F649B1E57D0
callq *%rax
I used @fastmath
to get rid of the check, but it had no real impact on runtime.
The only other difference I noticed with the assembly was the x - y
version produced a few unnecessary vmov
-type instructions.