I have a pretty basic question - I’m evaluating some function over an array of inputs.
@inline function f(x)
y = @. log(x)
return @. exp(y)
end
This results in extra allocations, and I thought avoiding allocations would speed it up. However, when I time it, I find that the runtime is actually lower for more allocations.
julia> x = collect(LinRange(1,10,1000000));
julia> @btime f($x);
12.157 ms (4 allocations: 15.26 MiB)
julia> @btime f.($x);
17.541 ms (2 allocations: 7.63 MiB)
Even evaluating in-place doesn’t seem to speed things up
julia> function f_eval!(out,x,f::F) where {F}
N = length(x)
for i = 1:N
out[i] = f(x[i])
end
end
f_eval! (generic function with 2 methods)
julia> @btime f_eval!($out,$x,$f);
17.524 ms (0 allocations: 0 bytes)
Is it known why the first version f(x) with more allocations performs better? These are run with v1.6 with a single thread.
There are a couple of weirdnesses here. You broadcast log and exp but you don’t fuse the calls together, which would save one big allocation. Your second call where you broadcast f itself means that f is called on scalar Float64s, which means the inner function doesn’t need to allocate anything. That might be slower than version 1 if exp and log broadcasted over arrays are compiled better, maybe with SIMD.
The third version could be slow if the bounds check is the defining factor, try the setindex call with @inbounds maybe
The differences I see are very small, but they seem to exist. One reason for the difference I see seems to be bounds checking. If I assert the fact that the input and output vectors have the same length, probably bound checking is dropped down and the final function gets slightly faster. But the difference here are small. I am not completely sure if they are meaningful.
julia> x = collect(LinRange(1,10,1000000));
julia> function f(x)
y = @. log(x)
@. exp(y)
end
f (generic function with 1 method)
julia> function g(x)
y = similar(x)
for i in 1:length(x)
y[i] = log(x[i])
end
for i in 1:length(x)
y[i] = exp(y[i])
end
y
end
g (generic function with 1 method)
julia> g(x) ≈ f(x)
true
julia> @btime f($x);
11.107 ms (4 allocations: 15.26 MiB)
julia> @btime g($x);
11.157 ms (2 allocations: 7.63 MiB)
julia> function h(x,y)
for i in 1:length(x)
y[i] = log(x[i])
end
for i in 1:length(x)
y[i] = exp(y[i])
end
y
end
h (generic function with 1 method)
julia> y = similar(x);
julia> @btime h($x,$y); # slightly slower
11.343 ms (0 allocations: 0 bytes)
julia> function h(x,y)
@assert length(x) == length(y)
for i in 1:length(x)
y[i] = log(x[i])
end
for i in 1:length(x)
y[i] = exp(y[i])
end
y
end
h (generic function with 1 method)
julia> @btime h($x,$y); # slightly faster
11.091 ms (0 allocations: 0 bytes)
julia> function h(x,y)
for i in 1:length(x)
@inbounds y[i] = log(x[i])
end
for i in 1:length(x)
@inbounds y[i] = exp(y[i])
end
y
end
h (generic function with 1 method)
julia> @btime h($x,$y); # again slightly faster
10.901 ms (0 allocations: 0 bytes)
Thanks @jules. I didn’t fuse log/exp precisely because I wanted to force more allocations in the MWE (and mimic the actual functions I call in my code).
Setindex and @inbounds didn’t seem to make much difference.
julia> function f_eval!(out,x,f::F) where {F}
N = length(x)
@inbounds for i = 1:N
setindex!(out,f(getindex(x,i)),i)
end
end
f_eval! (generic function with 1 method)
julia> @btime f_eval!($out,$x,$f);
17.620 ms (0 allocations: 0 bytes)
julia> using BenchmarkTools
julia> A = [1.0, 2.0, 3.0, 4.0, 5.0];
julia> f(x) = exp.(log.(x))
f (generic function with 1 method)
julia> h(x) = begin tmp = log.(x); exp.(tmp) end
h (generic function with 1 method)
julia> @benchmark f($A)
BenchmarkTools.Trial:
memory estimate: 128 bytes
allocs estimate: 1
--------------
minimum time: 104.237 ns (0.00% GC)
median time: 108.157 ns (0.00% GC)
mean time: 116.103 ns (0.70% GC)
maximum time: 623.623 ns (69.03% GC)
--------------
samples: 10000
evals/sample: 944
julia> @benchmark h($A)
BenchmarkTools.Trial:
memory estimate: 256 bytes
allocs estimate: 2
--------------
minimum time: 87.996 ns (0.00% GC)
median time: 92.171 ns (0.00% GC)
mean time: 97.015 ns (1.31% GC)
maximum time: 458.142 ns (68.08% GC)
--------------
samples: 10000
evals/sample: 958
julia> exp_log(x) = exp(log(x))
exp_log (generic function with 1 method)
julia> @code_llvm exp_log(1.0)
; @ REPL[8]:1 within `exp_log'
; Function Attrs: uwtable
define double @julia_exp_log_898(double %0) #0 {
top:
%1 = call double @j_log_900(double %0) #0
%2 = call double @j_exp_901(double %1) #0
ret double %2
}
Since a temporary array is created in h , the number of allocations is twice as big as for f where only one array needs to be allocated. However, the compiler seems to decide that it’s good to inline exp and log when they are not fused but to perform real function calls when they are fused.
Note that you can get a nice speedup using LoopVectorization.jl for your example.
julia> using BenchmarkTools, LoopVectorization
julia> f(x) = exp.(log.(x))
f (generic function with 1 method)
julia> f_avx(x) = @avx exp.(log.(x))
f_avx (generic function with 1 method)
julia> h_avx(x) = begin @avx tmp = log.(x); @avx exp.(tmp) end
h_avx (generic function with 1 method)
julia> A = [1.0, 2.0, 3.0, 4.0, 5.0];
julia> f(A) ≈ f_avx(A) ≈ h_avx(A)
true
julia> @benchmark f_avx($A)
BenchmarkTools.Trial:
memory estimate: 128 bytes
allocs estimate: 1
--------------
minimum time: 56.795 ns (0.00% GC)
median time: 59.838 ns (0.00% GC)
mean time: 66.627 ns (2.38% GC)
maximum time: 1.098 μs (93.35% GC)
--------------
samples: 10000
evals/sample: 986
julia> @benchmark h_avx($A)
BenchmarkTools.Trial:
memory estimate: 256 bytes
allocs estimate: 2
--------------
minimum time: 78.704 ns (0.00% GC)
median time: 84.671 ns (0.00% GC)
mean time: 93.133 ns (3.04% GC)
maximum time: 1.015 μs (90.41% GC)
--------------
samples: 10000
evals/sample: 972
That’s more like what I would have expected in this case.
Interesting. How can you tell that exp and log are are inlined? I know you can use @code_typed to check inlining for a function you wrote, but does it work for special functions too?
When I call @code_llvm on f(x), I get (buried within the output)
Hmm, looks like they’re indeed not inlined. I thought I had some some inlining in the other thread cited above, but that’s maybe just too long ago and my memory isn’t good enough