Why is the function evaluation with more allocations faster?

I have a pretty basic question - I’m evaluating some function over an array of inputs.

@inline function f(x) 
y = @. log(x)
return @. exp(y)
end

This results in extra allocations, and I thought avoiding allocations would speed it up. However, when I time it, I find that the runtime is actually lower for more allocations.

julia> x = collect(LinRange(1,10,1000000));

julia> @btime f($x);
  12.157 ms (4 allocations: 15.26 MiB)

julia> @btime f.($x);
  17.541 ms (2 allocations: 7.63 MiB)

Even evaluating in-place doesn’t seem to speed things up

julia> function f_eval!(out,x,f::F) where {F}
       N = length(x)
       for i = 1:N
         out[i] = f(x[i])
       end
       end
f_eval! (generic function with 2 methods)

julia> @btime f_eval!($out,$x,$f);
  17.524 ms (0 allocations: 0 bytes)

Is it known why the first version f(x) with more allocations performs better? These are run with v1.6 with a single thread.

There are a couple of weirdnesses here. You broadcast log and exp but you don’t fuse the calls together, which would save one big allocation. Your second call where you broadcast f itself means that f is called on scalar Float64s, which means the inner function doesn’t need to allocate anything. That might be slower than version 1 if exp and log broadcasted over arrays are compiled better, maybe with SIMD.

The third version could be slow if the bounds check is the defining factor, try the setindex call with @inbounds maybe

1 Like

The differences I see are very small, but they seem to exist. One reason for the difference I see seems to be bounds checking. If I assert the fact that the input and output vectors have the same length, probably bound checking is dropped down and the final function gets slightly faster. But the difference here are small. I am not completely sure if they are meaningful.

julia> x = collect(LinRange(1,10,1000000));

julia> function f(x)
         y = @. log(x)
         @. exp(y)
       end
f (generic function with 1 method)

julia> function g(x)
         y = similar(x)
         for i in 1:length(x)
           y[i] = log(x[i])
         end
         for i in 1:length(x)
           y[i] = exp(y[i])
         end
         y
       end
g (generic function with 1 method)

julia> g(x) ≈ f(x)
true

julia> @btime f($x);
  11.107 ms (4 allocations: 15.26 MiB)

julia> @btime g($x);
  11.157 ms (2 allocations: 7.63 MiB)

julia> function h(x,y)
         for i in 1:length(x)
           y[i] = log(x[i])
         end
         for i in 1:length(x)
           y[i] = exp(y[i])
         end
         y
       end
h (generic function with 1 method)

julia> y = similar(x);

julia> @btime h($x,$y);  # slightly slower
  11.343 ms (0 allocations: 0 bytes)

julia> function h(x,y)
         @assert length(x) == length(y)
         for i in 1:length(x)
           y[i] = log(x[i])
         end
         for i in 1:length(x)
           y[i] = exp(y[i])
         end
         y
       end
h (generic function with 1 method)

julia> @btime h($x,$y); # slightly faster
  11.091 ms (0 allocations: 0 bytes)

julia> function h(x,y)
         for i in 1:length(x)
           @inbounds y[i] = log(x[i])
         end
         for i in 1:length(x)
           @inbounds y[i] = exp(y[i])
         end
         y
       end
h (generic function with 1 method)

julia> @btime h($x,$y); # again slightly faster
  10.901 ms (0 allocations: 0 bytes)





1 Like

Thanks @jules. I didn’t fuse log/exp precisely because I wanted to force more allocations in the MWE (and mimic the actual functions I call in my code).

Setindex and @inbounds didn’t seem to make much difference.

julia> function f_eval!(out,x,f::F) where {F}
              N = length(x)
              @inbounds for i = 1:N
                setindex!(out,f(getindex(x,i)),i)
              end
              end
f_eval! (generic function with 1 method)

julia> @btime f_eval!($out,$x,$f);
  17.620 ms (0 allocations: 0 bytes)

Seems to be the same reason as in Unexpected speed difference for differently chained functions - #3 by ranocha. The compiler can inline a single broadcast using special functions but decides not to do so when two special function calls are chained.

julia> using BenchmarkTools

julia> A = [1.0, 2.0, 3.0, 4.0, 5.0];

julia> f(x) = exp.(log.(x))
f (generic function with 1 method)

julia> h(x) = begin tmp = log.(x); exp.(tmp) end
h (generic function with 1 method)

julia> @benchmark f($A)
BenchmarkTools.Trial:
  memory estimate:  128 bytes
  allocs estimate:  1
  --------------
  minimum time:     104.237 ns (0.00% GC)
  median time:      108.157 ns (0.00% GC)
  mean time:        116.103 ns (0.70% GC)
  maximum time:     623.623 ns (69.03% GC)
  --------------
  samples:          10000
  evals/sample:     944

julia> @benchmark h($A)
BenchmarkTools.Trial:
  memory estimate:  256 bytes
  allocs estimate:  2
  --------------
  minimum time:     87.996 ns (0.00% GC)
  median time:      92.171 ns (0.00% GC)
  mean time:        97.015 ns (1.31% GC)
  maximum time:     458.142 ns (68.08% GC)
  --------------
  samples:          10000
  evals/sample:     958

julia> exp_log(x) = exp(log(x))
exp_log (generic function with 1 method)

julia> @code_llvm exp_log(1.0)
;  @ REPL[8]:1 within `exp_log'
; Function Attrs: uwtable
define double @julia_exp_log_898(double %0) #0 {
top:
  %1 = call double @j_log_900(double %0) #0
  %2 = call double @j_exp_901(double %1) #0
  ret double %2
}

Since a temporary array is created in h , the number of allocations is twice as big as for f where only one array needs to be allocated. However, the compiler seems to decide that it’s good to inline exp and log when they are not fused but to perform real function calls when they are fused.

Note that you can get a nice speedup using LoopVectorization.jl for your example.

julia> using BenchmarkTools, LoopVectorization
                                                                                                                                                                                                                                                                                                    julia> f(x) = exp.(log.(x))
f (generic function with 1 method)

julia> f_avx(x) = @avx exp.(log.(x))
f_avx (generic function with 1 method)

julia> h_avx(x) = begin @avx tmp = log.(x); @avx exp.(tmp) end
h_avx (generic function with 1 method)

julia> A = [1.0, 2.0, 3.0, 4.0, 5.0];

julia> f(A) ≈ f_avx(A) ≈ h_avx(A)
true
julia> @benchmark f_avx($A)
BenchmarkTools.Trial:
memory estimate:  128 bytes
allocs estimate:  1
--------------
minimum time:     56.795 ns (0.00% GC)
median time:      59.838 ns (0.00% GC)
mean time:        66.627 ns (2.38% GC)
maximum time:     1.098 μs (93.35% GC)
--------------
samples:          10000
evals/sample:     986

julia> @benchmark h_avx($A)
BenchmarkTools.Trial:
memory estimate:  256 bytes
allocs estimate:  2
--------------
minimum time:     78.704 ns (0.00% GC)
median time:      84.671 ns (0.00% GC)
mean time:        93.133 ns (3.04% GC)
maximum time:     1.015 μs (90.41% GC)
--------------
samples:          10000
evals/sample:     972   

That’s more like what I would have expected in this case.

4 Likes

Interesting. How can you tell that exp and log are are inlined? I know you can use @code_typed to check inlining for a function you wrote, but does it work for special functions too?

When I call @code_llvm on f(x), I get (buried within the output)

                 %55 = call double @j_log_609(double %52)
                  ...(some broadcast output here)
                 %56 = call double @j_exp_610(double %55)

while for h(x), I get

               %55 = call double @j_log_615(double %52)
               %56 = call double @j_exp_616(double %55)

I’m not sure how to conclude that one function is inlined and the other isn’t based on @code_llvm output.

Hmm, looks like they’re indeed not inlined. I thought I had some some inlining in the other thread cited above, but that’s maybe just too long ago and my memory isn’t good enough :sweat_smile:

1 Like