Performance of findmax vs. raw loop

pauljurczak · July 18, 2020, 7:19pm

I’m curious if there is a high level concept to find maximum element and index in mapped array, which has similar performance to raw loop. I’m comparing f4 and f5:

using BenchmarkTools, Random, StaticArrays, LinearAlgebra

points = rand(MersenneTwister(12345), SVector{2, Float32}, 10_000)

function f4(p)
  total = 0.
  det0  = det([p[1] p[end]])
  d0    = p[end] - p[1]

  (vmax, imax) = findmax(map(x -> abs(det([x d0]) + det0), p))
  total += vmax + p[imax][1] + p[imax][2]

  return total
end

function f5(p)
  total = 0.
  det0  = det([p[1] p[end]])
  d0    = p[end] - p[1]
  vmax  = -Inf
  imax  = 0

  for i in eachindex(p)
    if abs(det([p[i] d0]) + det0) > vmax
      vmax = abs(det([p[i] d0]) + det0)
      imax = i
    end
  end

  total += vmax + p[imax][1] + p[imax][2]

  return total
end

println(f4(points))
println(f5(points))

println("$VERSION f4: $(minimum(@benchmark f4($points)))")
println("$VERSION f5: $(minimum(@benchmark f5($points)))")

and raw loop in f5 is much faster:

1.5.0-rc1.0 f4: TrialEstimate(27.355 μs)
1.5.0-rc1.0 f5: TrialEstimate(22.350 μs)

dpsanders · July 18, 2020, 7:24pm

f4 allocates an array.
There should be (but doesn’t seem to be – I think it’s an open issue or PR) a version findmax(f, v).

dpsanders · July 18, 2020, 7:43pm

https://github.com/JuliaLang/julia/issues/27613

EDIT:
and https://github.com/JuliaLang/julia/pull/35316

DNF · July 18, 2020, 7:56pm

Performance of findmax aside, you are calculating

twice per iteration in your loop, which seems unnecessary.

pauljurczak · July 18, 2020, 8:39pm

That was intentional. This should be the easiest trick in compiler optimization book to handle, but reality is slightly different:

function f5(p)
  total = 0.
  det0  = det([p[1] p[end]])
  d0    = p[end] - p[1]
  vmax  = -Inf
  imax  = 0

  for i in eachindex(p)
    if abs(det([p[i] d0]) + det0) > vmax
      vmax = abs(det([p[i] d0]) + det0)
      imax = i
    end
  end

  total += vmax + p[imax][1] + p[imax][2]

  return total
end

function f6(p)
  total = 0.
  det0  = det([p[1] p[end]])
  d0    = p[end] - p[1]
  vmax  = -Inf
  imax  = 0

  for i in eachindex(p)
    v = abs(det([p[i] d0]) + det0)

    if v > vmax
      vmax = abs(det([p[i] d0]) + det0)
      imax = i
    end
  end

  total += vmax + p[imax][1] + p[imax][2]

  return total
end

Julia 1.5.0 does well:

paul@desktop:~/Apps/julia-1.5.0-rc1/bin$ ./julia -O3 --math-mode=fast ~/st/julia/bench7.jl
v"1.5.0-rc1.0" f5: 24.597 μs 	0 bytes 	1.7352712154388428
v"1.5.0-rc1.0" f6: 24.611 μs 	0 bytes 	1.7352712154388428

Julia 1.4.2 not so much:

paul@desktop:~/Apps/julia-1.5.0-rc1/bin$ julia -O3 --math-mode=fast ~/st/julia/bench7.jl
v"1.4.2" f5: 22.356 μs 	0 bytes 	1.7352712154388428
v"1.4.2" f6: 29.043 μs 	0 bytes 	1.7352712154388428

although it optimizes f5 better.

DNF · July 18, 2020, 8:46pm

Oh, absolutely not. I would never expect the compiler to optimize that away when working with arrays.

pauljurczak · July 18, 2020, 9:20pm

Why not? There is no side effects here, it should be easy to prove that these two copies of abs(det([p[i] d0]) + det0) have the same value. Modern C++ compilers are unlikely to fail to elide the second evaluation.

I took another look at measurements and with bounds checking disabled both Julia versions evaluate this expression only once:

paul@desktop:~/Apps/julia-1.5.0-rc1/bin$ ./julia -O3 --math-mode=fast --check-bounds=no ~/st/julia/bench7.jl
v"1.5.0-rc1.0" f5: 25.152 μs 	0 bytes 	1.7352712154388428
v"1.5.0-rc1.0" f6: 25.154 μs 	0 bytes 	1.7352712154388428
paul@desktop:~/Apps/julia-1.5.0-rc1/bin$ julia -O3 --math-mode=fast --check-bounds=no ~/st/julia/bench7.jl
v"1.4.2" f5: 24.592 μs 	0 bytes 	1.7352712154388428
v"1.4.2" f6: 24.594 μs 	0 bytes 	1.7352712154388428

DNF · July 18, 2020, 9:22pm

Some different process could mutate the array.

pauljurczak · July 18, 2020, 9:25pm

Good point, but somehow the compiler didn’t take this possibility into account in my example. Is there an option to state that explicitly, e.g. from now on this container is read only?

DNF · July 18, 2020, 9:47pm

Not sure.

pauljurczak · July 18, 2020, 9:48pm

It looks like Julia 1.5.0-rc1.0 has a performance regression on these tests. Not a lot on Intel Skylake (see above), but much worse on Intel Haswell:

v"1.4.2" f5: 22.148 μs 	0 bytes 	1.7352712154388428
v"1.4.2" f6: 22.144 μs 	0 bytes 	1.7352712154388428
v"1.5.0-rc1.0" f5: 25.204 μs 	0 bytes 	1.7352712154388428
v"1.5.0-rc1.0" f6: 25.199 μs 	0 bytes 	1.7352712154388428

Elrod · July 19, 2020, 12:57am

Note that starting Julia with --math-mode=fast is playing with fire:

> julia -O3 --math-mode=ieee -E "sinpi(0.15)"
0.45399049973954675
> julia -O3 --math-mode=fast -E "sinpi(0.15)"
0.0

From the definition of f6:

  for i in eachindex(p)
    v = abs(det([p[i] d0]) + det0)

    if v > vmax
      vmax = abs(det([p[i] d0]) + det0)
      imax = i
    end
  end

You didn’t actually reduce how often you’re calling det.

If you want to use LLVM.jl, you can. But it helps in situations like when you’re writing to one array and loading from another, it can tell the compiler that the write isn’t changing the load. That could allow it to reorder these reads and writes, and (for example) take advantage SIMD.

I would be very surprised if it helps in your example, however.
The array allocations are not inlined:

julia> @code_native debuginfo=:none syntax=:intel Array{Float64}(undef, 3)
        .text
        push    rax
        movabs  rdi, offset jl_system_image_data
        movabs  rax, offset jl_alloc_array_1d
        call    rax
        pop     rcx
        ret
        nop     dword ptr [rax]

julia> @code_native debuginfo=:none syntax=:intel Array{Float64}(undef, 3, 4)
        .text
        push    rax
        movabs  rdi, offset jl_system_image_data
        movabs  rax, offset jl_alloc_array_2d
        call    rax
        pop     rcx
        ret
        nop     dword ptr [rax]

julia> @code_native debuginfo=:none syntax=:intel Array{Float64}(undef, 3, 4, 5)
        .text
        push    rax
        movabs  rdi, 139781825231120
        movabs  rax, offset jl_alloc_array_3d
        call    rax
        pop     rcx
        ret
        nop     dword ptr [rax]

So what is actually going on is opaque to the compiler.

This may change some day. Keno said:

pauljurczak · July 19, 2020, 1:23am

This is way off! I don’t think there is a similar problem with simple non accumulating +, -, *, /.

It must have been one of the modern life distractions. The results are more convoluted now. Intel Skylake:

paul@desktop:~/Apps/julia-1.5.0-rc1/bin$ julia -O3 --math-mode=fast --check-bounds=no ~/st/julia/bench7.jl
v"1.4.2" f5: 25.152 μs 	0 bytes 	1.7352712154388428
v"1.4.2" f6: 22.888 μs 	0 bytes 	1.7352712154388428
paul@desktop:~/Apps/julia-1.5.0-rc1/bin$ ./julia -O3 --math-mode=fast --check-bounds=no ~/st/julia/bench7.jl
v"1.5.0-rc1.0" f5: 25.153 μs 	0 bytes 	1.7352712154388428
v"1.5.0-rc1.0" f6: 25.168 μs 	0 bytes 	1.7352712154388428

Intel Haswell:

paul@extra:~/apps/julia-1.5.0-rc1/bin$ cset shield --exec julia -- -O3 --math-mode=fast --check-bounds=no bench7.jl
v"1.4.2" f5: 22.182 μs 	0 bytes 	1.7352712154388428
v"1.4.2" f6: 25.185 μs 	0 bytes 	1.7352712154388428
paul@extra:~/apps/julia-1.5.0-rc1/bin$ cset shield --exec ./julia -- -O3 --math-mode=fast --check-bounds=no bench7.jl
v"1.5.0-rc1.0" f5: 25.203 μs 	0 bytes 	1.7352712154388428
v"1.5.0-rc1.0" f6: 25.227 μs 	0 bytes 	1.7352712154388428

I can’t believe it got slower in some cases. Here is the new f6:

function f6(p)
  total = 0.
  det0  = det([p[1] p[end]])
  d0    = p[end] - p[1]
  vmax  = -Inf
  imax  = 0

  for i in eachindex(p)
    v = abs(det([p[i] d0]) + det0)

    if v > vmax
      vmax = v
      imax = i
    end
  end

  total += vmax + p[imax][1] + p[imax][2]

  return total
end

Topic		Replies	Views
Poor performance due to memory allocations? Performance memory-allocation	17	2875	January 15, 2019
Memory allocations in for loop variable Performance question	28	3687	August 31, 2018
Optimizing iteration over slices of multiple matrices Performance question , array	14	1918	December 30, 2018
Fast 4D argmax Performance tullio	26	2171	April 6, 2021
Reducing allocations to optimize (to match Java speed) Performance	22	1376	December 10, 2018

Performance of findmax vs. raw loop

Related topics