Unexpected speed difference for differently chained functions

I want to chain several to be applied element-wise on an array. I have first used the default approach with f(A) = @. f1(f2(A)). However I noticed, that by defining the captures c1 = x -> f1.(x) and c2 = x -> f2.(x) and then defining f = ∘(c1, c2), I get about a 30% to 40% speedup, although the memory footprint doubles. Here is an example of this:

using BenchmarkTools

c_sin = x -> sin.(x)
c_exp = x -> exp.(x)

f(x) = @. sin(exp(x))
h = ∘(c_sin, c_exp)

A = randn(100,100)

display(@benchmark f(A))
display(@benchmark h(A))

The output on my machine for f is

BenchmarkTools.Trial: 
  memory estimate:  78.20 KiB
  allocs estimate:  2
  --------------
  minimum time:     249.678 μs (0.00% GC)
  median time:      253.391 μs (0.00% GC)
  mean time:        254.927 μs (0.15% GC)
  maximum time:     506.079 μs (45.32% GC)
  --------------
  samples:          10000
  evals/sample:     1

and for h

BenchmarkTools.Trial: 
  memory estimate:  156.41 KiB
  allocs estimate:  4
  --------------
  minimum time:     164.288 μs (0.00% GC)
  median time:      174.107 μs (0.00% GC)
  mean time:        178.040 μs (0.64% GC)
  maximum time:     671.640 μs (73.76% GC)
  --------------
  samples:          10000
  evals/sample:     1

This behavior holds for more than two chained functions as well as various sizes of A. What can cause this?

1 Like

My gut feeling about the doubling memory is that @. sin(exp(x)) is fusing (it does exp and sin on each element of A) while ∘(c_sin, c_exp) isn’t (it does exp. on A to make a temporary array with the same size as A, then does sin. on that temporary array). I don’t know why the fused version took longer, though; if anything, I would’ve predicted it saving time by not needing to allocate the temporary array.

TLDR: It looks like the compiler decides to inline sin and exp for your second version but not for the combined version.

I used $ to get a bit more reliable benchmark results. Your code is basically equivalent to

julia> using BenchmarkTools

julia> A = [1.0, 2.0, 3.0, 4.0, 5.0];

julia> f(x) = sin.(exp.(x))

julia> h(x) = begin tmp = exp.(x); sin.(tmp) end

julia> @benchmark f($A)
BenchmarkTools.Trial: 
  memory estimate:  128 bytes
  allocs estimate:  1
  --------------
  minimum time:     128.500 ns (0.00% GC)
  median time:      137.732 ns (0.00% GC)
  mean time:        147.684 ns (0.80% GC)
  maximum time:     859.902 ns (74.62% GC)
  --------------
  samples:          10000
  evals/sample:     864

julia> @benchmark h($A)
BenchmarkTools.Trial: 
  memory estimate:  256 bytes
  allocs estimate:  2
  --------------
  minimum time:     105.619 ns (0.00% GC)
  median time:      112.411 ns (0.00% GC)
  mean time:        119.303 ns (1.69% GC)
  maximum time:     809.603 ns (78.31% GC)
  --------------
  samples:          10000
  evals/sample:     927

Since a temporary array is created in h, the number of allocations is twice as big as for f where only one array needs to be allocated. However, the compiler seems to decide that it’s good to inline sin and exp when they are not fused but to perform real function calls when they are fused.

julia> sin_exp(x) = sin(exp(x))
sin_exp (generic function with 1 method)

julia> @code_llvm sin_exp(1.0)

;  @ REPL[24]:1 within `sin_exp'
define double @julia_sin_exp_584(double) {
top:
  %1 = call double @j_exp_585(double %0)
  %2 = call double @j_sin_586(double %1)
  ret double %2
}

julia> @code_llvm sin(exp(1.0))
[...]

I also tried to use @inline and Base.@_inline_meta when defining sin_exp, but that didn’t help.

Note that you can get a nice speedup using LoopVectorization.jl for your example.

julia> using BenchmarkTools, LoopVectorization

julia> f(x) = sin.(exp.(x))
f (generic function with 1 method)

julia> f_avx(x) = @avx sin.(exp.(x))
f_avx (generic function with 1 method)

julia> h_avx(x) = begin @avx tmp = exp.(x); @avx sin.(tmp) end
h_avx (generic function with 1 method)

julia> A = [1.0, 2.0, 3.0, 4.0, 5.0];

julia> f(A) ≈ f_avx(A) ≈ h_avx(A)
true

julia> @benchmark f_avx($A)
BenchmarkTools.Trial: 
  memory estimate:  128 bytes
  allocs estimate:  1
  --------------
  minimum time:     46.067 ns (0.00% GC)
  median time:      49.266 ns (0.00% GC)
  mean time:        52.956 ns (2.50% GC)
  maximum time:     881.744 ns (84.21% GC)
  --------------
  samples:          10000
  evals/sample:     987

julia> @benchmark h_avx($A)
BenchmarkTools.Trial: 
  memory estimate:  256 bytes
  allocs estimate:  2
  --------------
  minimum time:     66.805 ns (0.00% GC)
  median time:      69.883 ns (0.00% GC)
  mean time:        76.609 ns (3.52% GC)
  maximum time:     1.106 μs (91.89% GC)
  --------------
  samples:          10000
  evals/sample:     975

That’s more like what I would have expected in this case.

3 Likes

Interesting. I wonder if the lack of inlining has to do with broadcasting being lazily fused, but I don’t know how broadcasts are compiled or the rules around inlining, which is also why I’m surprised that sin(exp(x)) isn’t inlining but begin tmp = exp.(x); sin.(tmp) end does.