Unexpected speed difference for differently chained functions

TLDR: It looks like the compiler decides to inline sin and exp for your second version but not for the combined version.

I used $ to get a bit more reliable benchmark results. Your code is basically equivalent to

julia> using BenchmarkTools

julia> A = [1.0, 2.0, 3.0, 4.0, 5.0];

julia> f(x) = sin.(exp.(x))

julia> h(x) = begin tmp = exp.(x); sin.(tmp) end

julia> @benchmark f($A)
BenchmarkTools.Trial: 
  memory estimate:  128 bytes
  allocs estimate:  1
  --------------
  minimum time:     128.500 ns (0.00% GC)
  median time:      137.732 ns (0.00% GC)
  mean time:        147.684 ns (0.80% GC)
  maximum time:     859.902 ns (74.62% GC)
  --------------
  samples:          10000
  evals/sample:     864

julia> @benchmark h($A)
BenchmarkTools.Trial: 
  memory estimate:  256 bytes
  allocs estimate:  2
  --------------
  minimum time:     105.619 ns (0.00% GC)
  median time:      112.411 ns (0.00% GC)
  mean time:        119.303 ns (1.69% GC)
  maximum time:     809.603 ns (78.31% GC)
  --------------
  samples:          10000
  evals/sample:     927

Since a temporary array is created in h, the number of allocations is twice as big as for f where only one array needs to be allocated. However, the compiler seems to decide that it’s good to inline sin and exp when they are not fused but to perform real function calls when they are fused.

julia> sin_exp(x) = sin(exp(x))
sin_exp (generic function with 1 method)

julia> @code_llvm sin_exp(1.0)

;  @ REPL[24]:1 within `sin_exp'
define double @julia_sin_exp_584(double) {
top:
  %1 = call double @j_exp_585(double %0)
  %2 = call double @j_sin_586(double %1)
  ret double %2
}

julia> @code_llvm sin(exp(1.0))
[...]

I also tried to use @inline and Base.@_inline_meta when defining sin_exp, but that didn’t help.

Note that you can get a nice speedup using LoopVectorization.jl for your example.

julia> using BenchmarkTools, LoopVectorization

julia> f(x) = sin.(exp.(x))
f (generic function with 1 method)

julia> f_avx(x) = @avx sin.(exp.(x))
f_avx (generic function with 1 method)

julia> h_avx(x) = begin @avx tmp = exp.(x); @avx sin.(tmp) end
h_avx (generic function with 1 method)

julia> A = [1.0, 2.0, 3.0, 4.0, 5.0];

julia> f(A) ≈ f_avx(A) ≈ h_avx(A)
true

julia> @benchmark f_avx($A)
BenchmarkTools.Trial: 
  memory estimate:  128 bytes
  allocs estimate:  1
  --------------
  minimum time:     46.067 ns (0.00% GC)
  median time:      49.266 ns (0.00% GC)
  mean time:        52.956 ns (2.50% GC)
  maximum time:     881.744 ns (84.21% GC)
  --------------
  samples:          10000
  evals/sample:     987

julia> @benchmark h_avx($A)
BenchmarkTools.Trial: 
  memory estimate:  256 bytes
  allocs estimate:  2
  --------------
  minimum time:     66.805 ns (0.00% GC)
  median time:      69.883 ns (0.00% GC)
  mean time:        76.609 ns (3.52% GC)
  maximum time:     1.106 μs (91.89% GC)
  --------------
  samples:          10000
  evals/sample:     975

That’s more like what I would have expected in this case.

3 Likes