Running small functions bundled in an outer function taking twice the time of running them separately

atrabattoni · April 2, 2020, 9:02am

I’m working on optimizing some code. I isolated small functions, I benchmarked them and I was happy with their speed. But when I bundle those functions in an outer function, the timing almost twice the sum of the isolated function… Any hints on what’s wrong?

Here a simplified version of my code:

function toPolar!(x)
    for i = 1:size(x, 1)
        x[i, 1], x[i, 2] = sqrt(x[i, 1]^2 + x[i, 2]^2), atan(x[i, 2], x[i, 1])
    end
end

function toCartesian!(x)
    for i = 1:size(x, 1)
        x[i, 1], x[i, 2] = x[i, 1] * cos(x[i, 2]), x[i, 1] * sin(x[i, 2])
    end
end

function move!(x, v, T)
    for i = 1:size(x, 1)
        x[i, 1] += v[i, 1] * T
        x[i, 2] += v[i, 2] * T
    end
end

function outerFunction!(x, v, T)
    toCartesian!(x)
    move!(x, v, T)
    toPolar!(x)
end

And here the benchmarks:

julia> using BenchmarkTools
julia> x = rand(10000, 2)
julia> v = rand(10000, 2)
julia> T = 1.0
julia> @btime toPolar!($x)
  232.784 μs (0 allocations: 0 bytes)
julia> @btime toCartesian!($x)
  101.394 μs (0 allocations: 0 bytes)
julia> @btime move!($x, $v, $T)
  14.378 μs (0 allocations: 0 bytes)
julia> @btime outerFunction!($x, $v, $T)
  612.012 μs (0 allocations: 0 bytes)

I would have expected the total timing to be approximately 233 + 101 + 14 = 350 μs

Any help is welcome!

Tamas_Papp · April 2, 2020, 9:12am

I suspect it has to do with some of the data fitting into a CPU cache with the individual benchmarks, but only a higher level cache for the combined ones. I can replicate the phenomenon on 1.4, with multiple runs giving somewhat inconsistent timings.

In any case, making x and v 10x larger resolves the inconsistency for me (you may have to increase it more if you have a recent desktop CPU, mine is a puny laptop CPU with little cache) eg

julia> @btime toPolar!($x)
  4.520 ms (0 allocations: 0 bytes)

julia> @btime toCartesian!($x)
  1.284 ms (0 allocations: 0 bytes)

julia> @btime move!($x, $v, $T)
  148.501 μs (0 allocations: 0 bytes)

julia> @btime outerFunction!($x, $v, $T)
  5.503 ms (0 allocations: 0 bytes)

what the relevant benchmark is depends on your data size I guess.

Also, cf

atrabattoni · April 2, 2020, 9:42am

Thanks for the quick answer ! This is very interesting, CPUs are so complex… Yet I tried 10x (so 100 000x2 arrays) and it does not change anything on my machine. (All timing are multiplied by 10).

julia> @btime toPolar!($x)
  2.472 ms (0 allocations: 0 bytes)

julia> @btime toCartesian!($x)
  933.475 μs (0 allocations: 0 bytes)

julia> @btime move!($x, $v, $T)
  143.459 μs (0 allocations: 0 bytes)

julia> @btime outerFunction!($x, $v, $T)
  5.947 ms (0 allocations: 0 bytes)

Trying x100 give even weirder results (three times slower):

julia> @btime toPolar!($x)
  15.412 ms (0 allocations: 0 bytes)

julia> @btime toCartesian!($x)
  6.054 ms (0 allocations: 0 bytes)

julia> @btime move!($x, $v, $T)
  1.736 ms (0 allocations: 0 bytes)

julia> @btime outerFunction!($x, $v, $T)
  60.322 ms (0 allocations: 0 bytes)

My Machine:
MacBook Pro (Retina, 15 inches, end 2013)
CPU: 2,3 GHz Intel Core i7 quad core
RAM: 16 Go 1600 MHz DDR3

Elrod · April 2, 2020, 1:38pm

I think some of this is caused by subnormal numbers. Try Base.set_zero_subnormals(true). But I think the runtime of some of these functions is dependent on the inputs anyway.

julia> @btime toPolar!($x)
  21.843 μs (0 allocations: 0 bytes)

julia> @btime toCartesian!($x)
  36.984 μs (0 allocations: 0 bytes)

julia> @btime move!($x, $v, $T)
  10.899 μs (0 allocations: 0 bytes)

julia> @btime outerFunction!($x, $v, $T)
  396.177 μs (0 allocations: 0 bytes)

julia> @btime outerFunction!($x, $v, $T)
  390.866 μs (0 allocations: 0 bytes)

julia> x'
2×10000 LinearAlgebra.Adjoint{Float64,Array{Float64,2}}:
 162036.0       286048.0       218714.0      289972.0      195339.0      267310.0      48427.5      …  129643.0       44890.2      178622.0       150361.0      323757.0      332012.0
     -0.784077      -0.273904       2.78728      -1.11657      -2.45901      -2.97896      2.07389          0.495497      1.90864       0.602863       1.41974       1.85946       0.504327

julia> xorig = rand(10_000, 2);

julia> @btime outerFunction!(copyto!($x, $xorig), $v, $T)
  271.576 μs (0 allocations: 0 bytes)

julia> x'
2×10000 LinearAlgebra.Adjoint{Float64,Array{Float64,2}}:
 1.24306  1.12502   1.02377   1.06599   1.5638    1.56081   0.897093  0.923178  1.75841   0.929802  …  1.9735  1.60316   0.829732  0.937007  0.949524  1.06568   1.52295  0.975263  1.72918
 0.65567  0.539637  0.810454  0.539083  0.715913  0.878484  0.706034  0.251734  0.712464  0.498144     0.5842  0.202117  0.777766  0.883492  0.923131  0.803376  1.02457  0.753723  0.763108

Also, if you’re interested in micro optimizing this, something you can try:

using LoopVectorization

function toPolar!(y, x)
    @avx for i = 1:size(x, 1)
        y[i, 1] = sqrt(abs2(x[i, 1]) + abs2(x[i, 2]))
        y[i, 2] = atan(x[i, 2], x[i, 1])
    end
end
toPolar!(x) = toPolar!(x, x)

function toCartesian!(y, x)
    @avx for i = 1:size(x, 1)
        y[i, 1] = x[i, 1] * cos(x[i, 2])
        y[i, 2] = x[i, 1] * sin(x[i, 2])
    end
end
toCartesian!(x) = toCartesian!(x, x)

function move!(x, v, T)
    @inbounds @simd ivdep for i = 1:size(x, 1)
        x[i, 1] += v[i, 1] * T
        x[i, 2] += v[i, 2] * T
    end
end

There’s a bug in LoopVectorization that I’m still trying to diagnose that causes it to give wrong answers when you load and store to the same variable in these examples, hence the above workaround.

It does make the benchmarks more convenient (not having to reset x, so you can avoid the Base.set_zero_subnormals(true):

julia> @btime toPolar!($x, $xorig)
  26.319 μs (0 allocations: 0 bytes)

julia> @btime toCartesian!($x, $xorig)
  14.751 μs (0 allocations: 0 bytes)

julia> @btime move!($x, $v, $T)
  2.634 μs (0 allocations: 0 bytes)

julia> @btime outerFunction!($x, $v, $T)
  45.793 μs (0 allocations: 0 bytes)

julia> @btime outerFunction!(copyto!($x, $xorig), $v, $T)
  47.594 μs (0 allocations: 0 bytes)

Skoffer · April 2, 2020, 2:14pm

Since these functions mutating their arguments, this benchmarking is wrong.

On my laptop

x = rand(10000, 2)
v = rand(10000, 2)

@btime toPolar!($x)
# 468.735 μs (0 allocations: 0 bytes)

@btime move!($x, $v, $T)
# 13.270 μs

# Here I use the same x, only it is already mutated after previous `move` calculation.
@btime toPolar!($x)
# 29.086 μs

In your original setup, benchmark runs multiple times toPolar then toCartesian then move. Since arguments mutate it’s not the same as running toCartesian!, move!, toPolar! sequence. You should use setup (https://github.com/JuliaCI/BenchmarkTools.jl/blob/master/doc/manual.md#setup-and-teardown-phases)

x = rand(10000, 2)
v = rand(10000, 2)
T = 1.0

@btime toPolar!(x1) setup=(x1 = copy($x))
# 215.132 μs (0 allocations: 0 bytes)
@btime move!(x1, v1, T1) setup=(x1 = copy($x); v1 = copy($v); T1 = copy($T))
# 13.290 μs (0 allocations: 0 bytes)
@btime toCartesian!(x1) setup=(x1 = copy($x))
# 113.838 μs (0 allocations: 0 bytes)

@btime outerFunction!(x1, v1, T1) setup=(x1 = copy($x); v1 = copy($v); T1 = copy($T))
# 331.737 μs (0 allocations: 0 bytes)

atrabattoni · April 2, 2020, 3:54pm

Interesting. I thought that whatever the Array{Float64,2} those function would have in input, they would execute with the same time. It seems not to be the case… @Elrod This might be due to som subnormal numbers. @Skoffer Yeah I also realize that my benchmarking were depending on when I run them. What you propose makes a lot of sense! This amazing setup thing is the right way to benchmark functions that mutate their arguments! This solved my issue.

@Elrod Amazing optimization tricks! I’ll definitively have a look to LoopVectorization.jl. Your code gave me almost 3x overall performance improvement.

Thanks all, I learned a lot today.

rdeits · April 2, 2020, 4:00pm

FYI, remember that you often want to set evals=1 when using setup with mutating functions in BenchmarkTools. See the discussion in BenchmarkTools setup isn't run between each iteration? - #6 by rdeits and the manual section in https://github.com/JuliaCI/BenchmarkTools.jl/blob/master/doc/manual.md#setup-and-teardown-phases

atrabattoni · April 2, 2020, 4:06pm

@rdeits Just tried it, it works even better like that ! Thanks.

Topic		Replies	Views
Huge performance improvement by separating function? General Usage	15	1208	August 27, 2018
How to make a function 120 times faster General Usage performance , scope , parameters	15	901	July 7, 2022
How much is it normal that @time differs in time? New to Julia	20	2441	June 28, 2017
Adding `@time` to a function gets rid of type-instability and makes code run 6x fast?! Performance	7	956	March 1, 2018
@time is off by at least 10x New to Julia question	7	837	August 13, 2020

Running small functions bundled in an outer function taking twice the time of running them separately

Related topics