# Performance of findmax vs. raw loop

I’m curious if there is a high level concept to find maximum element and index in mapped array, which has similar performance to raw loop. I’m comparing `f4` and `f5`:

``````using BenchmarkTools, Random, StaticArrays, LinearAlgebra

points = rand(MersenneTwister(12345), SVector{2, Float32}, 10_000)

function f4(p)
total = 0.
det0  = det([p[1] p[end]])
d0    = p[end] - p[1]

(vmax, imax) = findmax(map(x -> abs(det([x d0]) + det0), p))
total += vmax + p[imax][1] + p[imax][2]

end

function f5(p)
total = 0.
det0  = det([p[1] p[end]])
d0    = p[end] - p[1]
vmax  = -Inf
imax  = 0

for i in eachindex(p)
if abs(det([p[i] d0]) + det0) > vmax
vmax = abs(det([p[i] d0]) + det0)
imax = i
end
end

total += vmax + p[imax][1] + p[imax][2]

end

println(f4(points))
println(f5(points))

println("\$VERSION f4: \$(minimum(@benchmark f4(\$points)))")
println("\$VERSION f5: \$(minimum(@benchmark f5(\$points)))")
``````

and raw loop in `f5` is much faster:

``````1.5.0-rc1.0 f4: TrialEstimate(27.355 μs)
1.5.0-rc1.0 f5: TrialEstimate(22.350 μs)
``````

`f4` allocates an array.
There should be (but doesn’t seem to be – I think it’s an open issue or PR) a version `findmax(f, v)`.

2 Likes
1 Like

Performance of `findmax` aside, you are calculating

twice per iteration in your loop, which seems unnecessary.

That was intentional. This should be the easiest trick in compiler optimization book to handle, but reality is slightly different:

``````function f5(p)
total = 0.
det0  = det([p[1] p[end]])
d0    = p[end] - p[1]
vmax  = -Inf
imax  = 0

for i in eachindex(p)
if abs(det([p[i] d0]) + det0) > vmax
vmax = abs(det([p[i] d0]) + det0)
imax = i
end
end

total += vmax + p[imax][1] + p[imax][2]

end

function f6(p)
total = 0.
det0  = det([p[1] p[end]])
d0    = p[end] - p[1]
vmax  = -Inf
imax  = 0

for i in eachindex(p)
v = abs(det([p[i] d0]) + det0)

if v > vmax
vmax = abs(det([p[i] d0]) + det0)
imax = i
end
end

total += vmax + p[imax][1] + p[imax][2]

end
``````

Julia 1.5.0 does well:

``````paul@desktop:~/Apps/julia-1.5.0-rc1/bin\$ ./julia -O3 --math-mode=fast ~/st/julia/bench7.jl
v"1.5.0-rc1.0" f5: 24.597 μs 	0 bytes 	1.7352712154388428
v"1.5.0-rc1.0" f6: 24.611 μs 	0 bytes 	1.7352712154388428
``````

Julia 1.4.2 not so much:

``````paul@desktop:~/Apps/julia-1.5.0-rc1/bin\$ julia -O3 --math-mode=fast ~/st/julia/bench7.jl
v"1.4.2" f5: 22.356 μs 	0 bytes 	1.7352712154388428
v"1.4.2" f6: 29.043 μs 	0 bytes 	1.7352712154388428
``````

although it optimizes `f5` better.

Oh, absolutely not. I would never expect the compiler to optimize that away when working with arrays.

1 Like

Why not? There is no side effects here, it should be easy to prove that these two copies of `abs(det([p[i] d0]) + det0)` have the same value. Modern C++ compilers are unlikely to fail to elide the second evaluation.

I took another look at measurements and with bounds checking disabled both Julia versions evaluate this expression only once:

``````paul@desktop:~/Apps/julia-1.5.0-rc1/bin\$ ./julia -O3 --math-mode=fast --check-bounds=no ~/st/julia/bench7.jl
v"1.5.0-rc1.0" f5: 25.152 μs 	0 bytes 	1.7352712154388428
v"1.5.0-rc1.0" f6: 25.154 μs 	0 bytes 	1.7352712154388428
paul@desktop:~/Apps/julia-1.5.0-rc1/bin\$ julia -O3 --math-mode=fast --check-bounds=no ~/st/julia/bench7.jl
v"1.4.2" f5: 24.592 μs 	0 bytes 	1.7352712154388428
v"1.4.2" f6: 24.594 μs 	0 bytes 	1.7352712154388428
``````

Some different process could mutate the array.

Good point, but somehow the compiler didn’t take this possibility into account in my example. Is there an option to state that explicitly, e.g. from now on this container is read only?

Not sure.

It looks like Julia 1.5.0-rc1.0 has a performance regression on these tests. Not a lot on Intel Skylake (see above), but much worse on Intel Haswell:

``````v"1.4.2" f5: 22.148 μs 	0 bytes 	1.7352712154388428
v"1.4.2" f6: 22.144 μs 	0 bytes 	1.7352712154388428
v"1.5.0-rc1.0" f5: 25.204 μs 	0 bytes 	1.7352712154388428
v"1.5.0-rc1.0" f6: 25.199 μs 	0 bytes 	1.7352712154388428
``````

Note that starting Julia with `--math-mode=fast` is playing with fire:

``````> julia -O3 --math-mode=ieee -E "sinpi(0.15)"
0.45399049973954675
> julia -O3 --math-mode=fast -E "sinpi(0.15)"
0.0
``````

From the definition of `f6`:

``````  for i in eachindex(p)
v = abs(det([p[i] d0]) + det0)

if v > vmax
vmax = abs(det([p[i] d0]) + det0)
imax = i
end
end
``````

You didn’t actually reduce how often you’re calling `det`.

If you want to use `LLVM.jl`, you can. But it helps in situations like when you’re writing to one array and loading from another, it can tell the compiler that the write isn’t changing the load. That could allow it to reorder these reads and writes, and (for example) take advantage SIMD.

I would be very surprised if it helps in your example, however.
The array allocations are not inlined:

``````julia> @code_native debuginfo=:none syntax=:intel Array{Float64}(undef, 3)
.text
push    rax
movabs  rdi, offset jl_system_image_data
movabs  rax, offset jl_alloc_array_1d
call    rax
pop     rcx
ret
nop     dword ptr [rax]

julia> @code_native debuginfo=:none syntax=:intel Array{Float64}(undef, 3, 4)
.text
push    rax
movabs  rdi, offset jl_system_image_data
movabs  rax, offset jl_alloc_array_2d
call    rax
pop     rcx
ret
nop     dword ptr [rax]

julia> @code_native debuginfo=:none syntax=:intel Array{Float64}(undef, 3, 4, 5)
.text
push    rax
movabs  rdi, 139781825231120
movabs  rax, offset jl_alloc_array_3d
call    rax
pop     rcx
ret
nop     dword ptr [rax]
``````

So what is actually going on is opaque to the compiler.

This may change some day. Keno said:

This is way off! I don’t think there is a similar problem with simple non accumulating +, -, *, /.

It must have been one of the modern life distractions. The results are more convoluted now. Intel Skylake:

``````paul@desktop:~/Apps/julia-1.5.0-rc1/bin\$ julia -O3 --math-mode=fast --check-bounds=no ~/st/julia/bench7.jl
v"1.4.2" f5: 25.152 μs 	0 bytes 	1.7352712154388428
v"1.4.2" f6: 22.888 μs 	0 bytes 	1.7352712154388428
paul@desktop:~/Apps/julia-1.5.0-rc1/bin\$ ./julia -O3 --math-mode=fast --check-bounds=no ~/st/julia/bench7.jl
v"1.5.0-rc1.0" f5: 25.153 μs 	0 bytes 	1.7352712154388428
v"1.5.0-rc1.0" f6: 25.168 μs 	0 bytes 	1.7352712154388428
``````

Intel Haswell:

``````paul@extra:~/apps/julia-1.5.0-rc1/bin\$ cset shield --exec julia -- -O3 --math-mode=fast --check-bounds=no bench7.jl
v"1.4.2" f5: 22.182 μs 	0 bytes 	1.7352712154388428
v"1.4.2" f6: 25.185 μs 	0 bytes 	1.7352712154388428
paul@extra:~/apps/julia-1.5.0-rc1/bin\$ cset shield --exec ./julia -- -O3 --math-mode=fast --check-bounds=no bench7.jl
v"1.5.0-rc1.0" f5: 25.203 μs 	0 bytes 	1.7352712154388428
v"1.5.0-rc1.0" f6: 25.227 μs 	0 bytes 	1.7352712154388428
``````

I can’t believe it got slower in some cases. Here is the new `f6`:

``````function f6(p)
total = 0.
det0  = det([p[1] p[end]])
d0    = p[end] - p[1]
vmax  = -Inf
imax  = 0

for i in eachindex(p)
v = abs(det([p[i] d0]) + det0)

if v > vmax
vmax = v
imax = i
end
end

total += vmax + p[imax][1] + p[imax][2]