Performance of findmax vs. raw loop

I’m curious if there is a high level concept to find maximum element and index in mapped array, which has similar performance to raw loop. I’m comparing f4 and f5:

using BenchmarkTools, Random, StaticArrays, LinearAlgebra

points = rand(MersenneTwister(12345), SVector{2, Float32}, 10_000)

function f4(p)
  total = 0.
  det0  = det([p[1] p[end]])
  d0    = p[end] - p[1]

  (vmax, imax) = findmax(map(x -> abs(det([x d0]) + det0), p))
  total += vmax + p[imax][1] + p[imax][2]

  return total
end

function f5(p)
  total = 0.
  det0  = det([p[1] p[end]])
  d0    = p[end] - p[1]
  vmax  = -Inf
  imax  = 0

  for i in eachindex(p)
    if abs(det([p[i] d0]) + det0) > vmax
      vmax = abs(det([p[i] d0]) + det0)
      imax = i
    end
  end

  total += vmax + p[imax][1] + p[imax][2]

  return total
end

println(f4(points))
println(f5(points))

println("$VERSION f4: $(minimum(@benchmark f4($points)))")
println("$VERSION f5: $(minimum(@benchmark f5($points)))")

and raw loop in f5 is much faster:

1.5.0-rc1.0 f4: TrialEstimate(27.355 μs)
1.5.0-rc1.0 f5: TrialEstimate(22.350 μs)

f4 allocates an array.
There should be (but doesn’t seem to be – I think it’s an open issue or PR) a version findmax(f, v).

2 Likes

EDIT:
and https://github.com/JuliaLang/julia/pull/35316

1 Like

Performance of findmax aside, you are calculating

twice per iteration in your loop, which seems unnecessary.

That was intentional. This should be the easiest trick in compiler optimization book to handle, but reality is slightly different:

function f5(p)
  total = 0.
  det0  = det([p[1] p[end]])
  d0    = p[end] - p[1]
  vmax  = -Inf
  imax  = 0

  for i in eachindex(p)
    if abs(det([p[i] d0]) + det0) > vmax
      vmax = abs(det([p[i] d0]) + det0)
      imax = i
    end
  end

  total += vmax + p[imax][1] + p[imax][2]

  return total
end

function f6(p)
  total = 0.
  det0  = det([p[1] p[end]])
  d0    = p[end] - p[1]
  vmax  = -Inf
  imax  = 0

  for i in eachindex(p)
    v = abs(det([p[i] d0]) + det0)

    if v > vmax
      vmax = abs(det([p[i] d0]) + det0)
      imax = i
    end
  end

  total += vmax + p[imax][1] + p[imax][2]

  return total
end

Julia 1.5.0 does well:

paul@desktop:~/Apps/julia-1.5.0-rc1/bin$ ./julia -O3 --math-mode=fast ~/st/julia/bench7.jl
v"1.5.0-rc1.0" f5: 24.597 μs 	0 bytes 	1.7352712154388428
v"1.5.0-rc1.0" f6: 24.611 μs 	0 bytes 	1.7352712154388428

Julia 1.4.2 not so much:

paul@desktop:~/Apps/julia-1.5.0-rc1/bin$ julia -O3 --math-mode=fast ~/st/julia/bench7.jl
v"1.4.2" f5: 22.356 μs 	0 bytes 	1.7352712154388428
v"1.4.2" f6: 29.043 μs 	0 bytes 	1.7352712154388428

although it optimizes f5 better.

Oh, absolutely not. I would never expect the compiler to optimize that away when working with arrays.

1 Like

Why not? There is no side effects here, it should be easy to prove that these two copies of abs(det([p[i] d0]) + det0) have the same value. Modern C++ compilers are unlikely to fail to elide the second evaluation.

I took another look at measurements and with bounds checking disabled both Julia versions evaluate this expression only once:

paul@desktop:~/Apps/julia-1.5.0-rc1/bin$ ./julia -O3 --math-mode=fast --check-bounds=no ~/st/julia/bench7.jl
v"1.5.0-rc1.0" f5: 25.152 μs 	0 bytes 	1.7352712154388428
v"1.5.0-rc1.0" f6: 25.154 μs 	0 bytes 	1.7352712154388428
paul@desktop:~/Apps/julia-1.5.0-rc1/bin$ julia -O3 --math-mode=fast --check-bounds=no ~/st/julia/bench7.jl
v"1.4.2" f5: 24.592 μs 	0 bytes 	1.7352712154388428
v"1.4.2" f6: 24.594 μs 	0 bytes 	1.7352712154388428

Some different process could mutate the array.

Good point, but somehow the compiler didn’t take this possibility into account in my example. Is there an option to state that explicitly, e.g. from now on this container is read only?

Not sure.

It looks like Julia 1.5.0-rc1.0 has a performance regression on these tests. Not a lot on Intel Skylake (see above), but much worse on Intel Haswell:

v"1.4.2" f5: 22.148 μs 	0 bytes 	1.7352712154388428
v"1.4.2" f6: 22.144 μs 	0 bytes 	1.7352712154388428
v"1.5.0-rc1.0" f5: 25.204 μs 	0 bytes 	1.7352712154388428
v"1.5.0-rc1.0" f6: 25.199 μs 	0 bytes 	1.7352712154388428

Note that starting Julia with --math-mode=fast is playing with fire:

> julia -O3 --math-mode=ieee -E "sinpi(0.15)"
0.45399049973954675
> julia -O3 --math-mode=fast -E "sinpi(0.15)"
0.0

From the definition of f6:

  for i in eachindex(p)
    v = abs(det([p[i] d0]) + det0)

    if v > vmax
      vmax = abs(det([p[i] d0]) + det0)
      imax = i
    end
  end

You didn’t actually reduce how often you’re calling det.

If you want to use LLVM.jl, you can. But it helps in situations like when you’re writing to one array and loading from another, it can tell the compiler that the write isn’t changing the load. That could allow it to reorder these reads and writes, and (for example) take advantage SIMD.

I would be very surprised if it helps in your example, however.
The array allocations are not inlined:

julia> @code_native debuginfo=:none syntax=:intel Array{Float64}(undef, 3)
        .text
        push    rax
        movabs  rdi, offset jl_system_image_data
        movabs  rax, offset jl_alloc_array_1d
        call    rax
        pop     rcx
        ret
        nop     dword ptr [rax]

julia> @code_native debuginfo=:none syntax=:intel Array{Float64}(undef, 3, 4)
        .text
        push    rax
        movabs  rdi, offset jl_system_image_data
        movabs  rax, offset jl_alloc_array_2d
        call    rax
        pop     rcx
        ret
        nop     dword ptr [rax]

julia> @code_native debuginfo=:none syntax=:intel Array{Float64}(undef, 3, 4, 5)
        .text
        push    rax
        movabs  rdi, 139781825231120
        movabs  rax, offset jl_alloc_array_3d
        call    rax
        pop     rcx
        ret
        nop     dword ptr [rax]

So what is actually going on is opaque to the compiler.

This may change some day. Keno said:

This is way off! I don’t think there is a similar problem with simple non accumulating +, -, *, /.

It must have been one of the modern life distractions. The results are more convoluted now. Intel Skylake:

paul@desktop:~/Apps/julia-1.5.0-rc1/bin$ julia -O3 --math-mode=fast --check-bounds=no ~/st/julia/bench7.jl
v"1.4.2" f5: 25.152 μs 	0 bytes 	1.7352712154388428
v"1.4.2" f6: 22.888 μs 	0 bytes 	1.7352712154388428
paul@desktop:~/Apps/julia-1.5.0-rc1/bin$ ./julia -O3 --math-mode=fast --check-bounds=no ~/st/julia/bench7.jl
v"1.5.0-rc1.0" f5: 25.153 μs 	0 bytes 	1.7352712154388428
v"1.5.0-rc1.0" f6: 25.168 μs 	0 bytes 	1.7352712154388428

Intel Haswell:

paul@extra:~/apps/julia-1.5.0-rc1/bin$ cset shield --exec julia -- -O3 --math-mode=fast --check-bounds=no bench7.jl
v"1.4.2" f5: 22.182 μs 	0 bytes 	1.7352712154388428
v"1.4.2" f6: 25.185 μs 	0 bytes 	1.7352712154388428
paul@extra:~/apps/julia-1.5.0-rc1/bin$ cset shield --exec ./julia -- -O3 --math-mode=fast --check-bounds=no bench7.jl
v"1.5.0-rc1.0" f5: 25.203 μs 	0 bytes 	1.7352712154388428
v"1.5.0-rc1.0" f6: 25.227 μs 	0 bytes 	1.7352712154388428

I can’t believe it got slower in some cases. Here is the new f6:

function f6(p)
  total = 0.
  det0  = det([p[1] p[end]])
  d0    = p[end] - p[1]
  vmax  = -Inf
  imax  = 0

  for i in eachindex(p)
    v = abs(det([p[i] d0]) + det0)

    if v > vmax
      vmax = v
      imax = i
    end
  end

  total += vmax + p[imax][1] + p[imax][2]

  return total
end