Reducing allocations to optimize (to match Java speed)

Your comment about teaching about where allocations come from is right on target; I’m starting to understand better, but it’s been a bit mysterious and that was one of the main reasons I was asking for help.

For a bit of background, my reference point for comparison was the distinction in C between passing a value vs. passing a pointer. I’ve been trying to put Julia allocations in a similar comparison, which may or may not have been helpful.

I also agree with your comments about tutorials - I think it should not be hard for most scientists/engineers to dive into Julia to get first pass implementations and tests of solutions. The harder part is learning the approaches to take the next step to diagnosing performance problems, and learning coding styles that will make it easy to get that performance. A set of approved, high quality tutorials for practitioners addressing key topics would help a lot.

Also, I had a lot of my training at a time period when the discussion of SIMD vs MIMD was a hot topic, with analogous hardware debates comparing Connection Machines vs nCUBEs vs, eventually, clusters…

Note that creating an MArray and then passing it to another function will generally cause it to allocate memory. This means that for best performance, either

  1. Preallocate the MArray, and keep reusing it.
  2. Make sure everything gets inlined while using it.

Once upon a time, I recall operations with two MArrays producing SArrays to avoid allocating. But that no longer seems to be the case:

julia> using StaticArrays

julia> x = @MVector randn(3); x2 = @MVector randn(3);

julia> y = @SVector randn(3);

julia> typeof(x - x2)
MArray{Tuple{3},Float64,1,3}

julia> typeof(x - y)
MArray{Tuple{3},Float64,1,3}

julia> typeof(y - x)
SArray{Tuple{3},Float64,1,3}

Also, note that you can copy and paste the above code into a REPL and it will run. The "julia> " prompts will automatically disappear so long as your text starts with a "julia> ".

Unfortunately, a lot of innocent looking functions are not inlined. For example, * inside r is not, therefore you get allocations for creating an MVector with x - ξ.

julia> using BenchmarkTools, StaticArrays

julia> x = @MVector randn(3);  y = @SVector randn(3);

julia> function r(x, ξ)
           ((x-ξ)'*(x-ξ))^0.5
       end
r (generic function with 1 method)

julia> @benchmark r($x, $y)
BenchmarkTools.Trial: 
  memory estimate:  80 bytes
  allocs estimate:  3
  --------------
  minimum time:     49.321 ns (0.00% GC)
  median time:      51.471 ns (0.00% GC)
  mean time:        69.426 ns (22.39% GC)
  maximum time:     71.816 μs (99.85% GC)
  --------------
  samples:          10000
  evals/sample:     987

julia> @benchmark r($y, $x)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     9.487 ns (0.00% GC)
  median time:      9.504 ns (0.00% GC)
  mean time:        9.537 ns (0.00% GC)
  maximum time:     22.032 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @generated function my_sdot(x::Union{SArray{S,T,N,L},MArray{S,T,N,L}},y::Union{SArray{S,T,N,L},MArray{S,T,N,L}}) where {S,T,N,L}
           quote
               $(Expr(:meta, :inline))
               @fastmath $(Expr(:call, :+, [:(x[$l]*y[$l]) for l in 1:L]...))
           end
       end
my_sdot (generic function with 1 method)

julia> function rv2(x, ξ)
           δ = (x-ξ)
           my_sdot(δ, δ) |> sqrt
       end
rv2 (generic function with 1 method)

julia> @benchmark rv2($x, $y)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.302 ns (0.00% GC)
  median time:      8.312 ns (0.00% GC)
  mean time:        8.326 ns (0.00% GC)
  maximum time:     20.302 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark rv2($y, $x)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.307 ns (0.00% GC)
  median time:      8.318 ns (0.00% GC)
  mean time:        8.335 ns (0.00% GC)
  maximum time:     26.754 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

The @code_native was basically the same for both versions. It also interestingly did not include any checks for domain errors on square root: LLVM realized it was lower bounded at zero. Cool.

It can take some playing around. Ideally, you could use a more reasonable dot function. My initial version was actually just a for loop:

julia> @inline function my_inlined_dot(x, y)
           out = zero(promote_type(eltype(x), eltype(y)))
           for i in eachindex(x,y)
               out += x[i] * y[i]
           end
           out
       end
my_inlined_dot (generic function with 1 method)

julia> function rv3(x, ξ)
           δ = (x-ξ)
           my_inlined_dot(δ, δ) |> sqrt
       end
rv3 (generic function with 1 method)

julia> @benchmark rv3($x, $y)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     30.698 ns (0.00% GC)
  median time:      30.706 ns (0.00% GC)
  mean time:        30.778 ns (0.00% GC)
  maximum time:     42.336 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     994

julia> @benchmark rv3($y, $x)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.304 ns (0.00% GC)
  median time:      8.326 ns (0.00% GC)
  mean time:        8.356 ns (0.00% GC)
  maximum time:     120.015 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

I don’t totally understand what happened there. LLVM did not realize the x-y version of the above was lower bounded at zero, and @code_native showed this code:

	vucomisd	%xmm0, %xmm1
;}
	ja	L89
; Location: math.jl:493
	vsqrtsd	%xmm0, %xmm0, %xmm0
;}}
	addq	$24, %rsp
	retq
; Function |>; {
; Location: operators.jl:813
; Function sqrt; {
; Location: math.jl:492
L89:
	movabsq	$throw_complex_domainerror, %rax
	movabsq	$140070075914192, %rdi  # imm = 0x7F649B1E57D0
	callq	*%rax

I used @fastmath to get rid of the check, but it had no real impact on runtime.
The only other difference I noticed with the assembly was the x - y version produced a few unnecessary vmov-type instructions.

These tests are very interesting and insightful. My goal is to take these various examples and implement both in my corrected MWE and then the real code.

I’ll report back, but it will be at least a day or so - today must be dedicated to minor final revisions of a dissertation for a student :grinning: and grading of final exams :grimacing: