Why is the function evaluation with more allocations faster?

Seems to be the same reason as in Unexpected speed difference for differently chained functions - #3 by ranocha. The compiler can inline a single broadcast using special functions but decides not to do so when two special function calls are chained.

julia> using BenchmarkTools

julia> A = [1.0, 2.0, 3.0, 4.0, 5.0];

julia> f(x) = exp.(log.(x))
f (generic function with 1 method)

julia> h(x) = begin tmp = log.(x); exp.(tmp) end
h (generic function with 1 method)

julia> @benchmark f($A)
BenchmarkTools.Trial:
  memory estimate:  128 bytes
  allocs estimate:  1
  --------------
  minimum time:     104.237 ns (0.00% GC)
  median time:      108.157 ns (0.00% GC)
  mean time:        116.103 ns (0.70% GC)
  maximum time:     623.623 ns (69.03% GC)
  --------------
  samples:          10000
  evals/sample:     944

julia> @benchmark h($A)
BenchmarkTools.Trial:
  memory estimate:  256 bytes
  allocs estimate:  2
  --------------
  minimum time:     87.996 ns (0.00% GC)
  median time:      92.171 ns (0.00% GC)
  mean time:        97.015 ns (1.31% GC)
  maximum time:     458.142 ns (68.08% GC)
  --------------
  samples:          10000
  evals/sample:     958

julia> exp_log(x) = exp(log(x))
exp_log (generic function with 1 method)

julia> @code_llvm exp_log(1.0)
;  @ REPL[8]:1 within `exp_log'
; Function Attrs: uwtable
define double @julia_exp_log_898(double %0) #0 {
top:
  %1 = call double @j_log_900(double %0) #0
  %2 = call double @j_exp_901(double %1) #0
  ret double %2
}

Since a temporary array is created in h , the number of allocations is twice as big as for f where only one array needs to be allocated. However, the compiler seems to decide that it’s good to inline exp and log when they are not fused but to perform real function calls when they are fused.

Note that you can get a nice speedup using LoopVectorization.jl for your example.

julia> using BenchmarkTools, LoopVectorization
                                                                                                                                                                                                                                                                                                    julia> f(x) = exp.(log.(x))
f (generic function with 1 method)

julia> f_avx(x) = @avx exp.(log.(x))
f_avx (generic function with 1 method)

julia> h_avx(x) = begin @avx tmp = log.(x); @avx exp.(tmp) end
h_avx (generic function with 1 method)

julia> A = [1.0, 2.0, 3.0, 4.0, 5.0];

julia> f(A) ≈ f_avx(A) ≈ h_avx(A)
true
julia> @benchmark f_avx($A)
BenchmarkTools.Trial:
memory estimate:  128 bytes
allocs estimate:  1
--------------
minimum time:     56.795 ns (0.00% GC)
median time:      59.838 ns (0.00% GC)
mean time:        66.627 ns (2.38% GC)
maximum time:     1.098 μs (93.35% GC)
--------------
samples:          10000
evals/sample:     986

julia> @benchmark h_avx($A)
BenchmarkTools.Trial:
memory estimate:  256 bytes
allocs estimate:  2
--------------
minimum time:     78.704 ns (0.00% GC)
median time:      84.671 ns (0.00% GC)
mean time:        93.133 ns (3.04% GC)
maximum time:     1.015 μs (90.41% GC)
--------------
samples:          10000
evals/sample:     972   

That’s more like what I would have expected in this case.

4 Likes