TLDR: It looks like the compiler decides to inline sin
and exp
for your second version but not for the combined version.
I used $
to get a bit more reliable benchmark results. Your code is basically equivalent to
julia> using BenchmarkTools
julia> A = [1.0, 2.0, 3.0, 4.0, 5.0];
julia> f(x) = sin.(exp.(x))
julia> h(x) = begin tmp = exp.(x); sin.(tmp) end
julia> @benchmark f($A)
BenchmarkTools.Trial:
memory estimate: 128 bytes
allocs estimate: 1
--------------
minimum time: 128.500 ns (0.00% GC)
median time: 137.732 ns (0.00% GC)
mean time: 147.684 ns (0.80% GC)
maximum time: 859.902 ns (74.62% GC)
--------------
samples: 10000
evals/sample: 864
julia> @benchmark h($A)
BenchmarkTools.Trial:
memory estimate: 256 bytes
allocs estimate: 2
--------------
minimum time: 105.619 ns (0.00% GC)
median time: 112.411 ns (0.00% GC)
mean time: 119.303 ns (1.69% GC)
maximum time: 809.603 ns (78.31% GC)
--------------
samples: 10000
evals/sample: 927
Since a temporary array is created in h
, the number of allocations is twice as big as for f
where only one array needs to be allocated. However, the compiler seems to decide that it’s good to inline sin
and exp
when they are not fused but to perform real function calls when they are fused.
julia> sin_exp(x) = sin(exp(x))
sin_exp (generic function with 1 method)
julia> @code_llvm sin_exp(1.0)
; @ REPL[24]:1 within `sin_exp'
define double @julia_sin_exp_584(double) {
top:
%1 = call double @j_exp_585(double %0)
%2 = call double @j_sin_586(double %1)
ret double %2
}
julia> @code_llvm sin(exp(1.0))
[...]
I also tried to use @inline
and Base.@_inline_meta
when defining sin_exp
, but that didn’t help.
Note that you can get a nice speedup using LoopVectorization.jl for your example.
julia> using BenchmarkTools, LoopVectorization
julia> f(x) = sin.(exp.(x))
f (generic function with 1 method)
julia> f_avx(x) = @avx sin.(exp.(x))
f_avx (generic function with 1 method)
julia> h_avx(x) = begin @avx tmp = exp.(x); @avx sin.(tmp) end
h_avx (generic function with 1 method)
julia> A = [1.0, 2.0, 3.0, 4.0, 5.0];
julia> f(A) ≈ f_avx(A) ≈ h_avx(A)
true
julia> @benchmark f_avx($A)
BenchmarkTools.Trial:
memory estimate: 128 bytes
allocs estimate: 1
--------------
minimum time: 46.067 ns (0.00% GC)
median time: 49.266 ns (0.00% GC)
mean time: 52.956 ns (2.50% GC)
maximum time: 881.744 ns (84.21% GC)
--------------
samples: 10000
evals/sample: 987
julia> @benchmark h_avx($A)
BenchmarkTools.Trial:
memory estimate: 256 bytes
allocs estimate: 2
--------------
minimum time: 66.805 ns (0.00% GC)
median time: 69.883 ns (0.00% GC)
mean time: 76.609 ns (3.52% GC)
maximum time: 1.106 μs (91.89% GC)
--------------
samples: 10000
evals/sample: 975
That’s more like what I would have expected in this case.