Comparing performance of 2 simple averaging functions - why is one faster?

Below I include code for two simple functions that compute averages of pairs of adjacent values in an input 1D vector. Function f2 is almost twice as fast as f3 - I’d like to understand why! If anything, I would have expected them to be the same or for f3 to be better, since I’d think the Julia system would automagically be able to optimize a vector function with lots of dots. Allocations are the same, so there is something else going on.

If I try @code_llvm the output for f3 is extremely long though…

julia> function f2(d::Array{Float64,1})
           [ (d[i]+d[i+1])/2.0 for i in 1:length(d)-1]
       end
f2 (generic function with 1 method)

julia> function f3(d::Array{Float64,1})
           (d[1:end-1] .+ d[2:end]) ./= 2.0
       end
f3 (generic function with 1 method)

julia> a = [i for i in 1.0:100.0];

julia> using BenchmarkTools

julia> @benchmark f2($a)
BenchmarkTools.Trial: 
  memory estimate:  944 bytes
  allocs estimate:  3
  --------------
  minimum time:     155.671 ns (0.00% GC)
  median time:      182.914 ns (0.00% GC)
  mean time:        191.812 ns (5.42% GC)
  maximum time:     1.047 μs (72.99% GC)
  --------------
  samples:          10000
  evals/sample:     800

julia> @benchmark f3($a)
BenchmarkTools.Trial: 
  memory estimate:  2.63 KiB
  allocs estimate:  3
  --------------
  minimum time:     223.421 ns (0.00% GC)
  median time:      304.436 ns (0.00% GC)
  mean time:        363.418 ns (6.18% GC)
  maximum time:     1.563 μs (77.07% GC)
  --------------
  samples:          10000
  evals/sample:     535

d[1:end-1] and d[2:end] are both copies. If you instead write

function f4(d::Array{Float64,1})
    @views (d[1:end-1] .+ d[2:end]) ./= 2.0
end

it should be as fast. For a generic version, consider

function f4(d::Vector)
    two_T = one(eltype(d))+one(eltype(d))
    @views (d[1:end-1] .+ d[2:end]) ./= two_T =
end

it should be just as fast, but will work for arbitrarily typed Vectors

3 Likes

Ah, perfect - thanks. This is really helpful, because I’ve also been trying to work to learn how to intelligently use views so your post helps me learn two things at the same time!

the behavior of functions f4 (correcting the typo “=” sign on the second) is interesting on my machine - the median benchmark time is consistently 25% faster than for f2, but the mean for f2 is a bit better (just a few percent though).

I’m pretty sure it’s no better than just dividing by the integer 2. Do you have any example where 2 falls short?

1 Like

I think this means that the division by two has to be done on a separate pass, acting in-place on the newly created array, instead of being done along with the .+. A bigger cost at larger sizes, but visible here:

julia> @btime @views ($a[1:end-1] .+ $a[2:end]) ./= 2;
  92.228 ns (1 allocation: 896 bytes)

julia> @btime @views ($a[1:end-1] .+ $a[2:end]) ./ 2;
  72.519 ns (1 allocation: 896 bytes)

julia> @btime @views ($a[1:end-1] .+ $a[2:end]) .* (1/2);
  72.702 ns (1 allocation: 896 bytes)

Also, f2 benefits from @inbounds:

julia> @btime [ ($a[i]+$a[i+1])/2 for i in 1:length($a)-1 ];
  135.400 ns (1 allocation: 896 bytes)

julia> @btime [ @inbounds($a[i]+$a[i+1])/2 for i in 1:length($a)-1 ];
  78.197 ns (1 allocation: 896 bytes)
1 Like

Good tip - thanks. The other examples are insightful too!