Running small functions bundled in an outer function taking twice the time of running them separately

I’m working on optimizing some code. I isolated small functions, I benchmarked them and I was happy with their speed. But when I bundle those functions in an outer function, the timing almost twice the sum of the isolated function… Any hints on what’s wrong?

Here a simplified version of my code:

function toPolar!(x)
    for i = 1:size(x, 1)
        x[i, 1], x[i, 2] = sqrt(x[i, 1]^2 + x[i, 2]^2), atan(x[i, 2], x[i, 1])
    end
end

function toCartesian!(x)
    for i = 1:size(x, 1)
        x[i, 1], x[i, 2] = x[i, 1] * cos(x[i, 2]), x[i, 1] * sin(x[i, 2])
    end
end

function move!(x, v, T)
    for i = 1:size(x, 1)
        x[i, 1] += v[i, 1] * T
        x[i, 2] += v[i, 2] * T
    end
end

function outerFunction!(x, v, T)
    toCartesian!(x)
    move!(x, v, T)
    toPolar!(x)
end

And here the benchmarks:

julia> using BenchmarkTools
julia> x = rand(10000, 2)
julia> v = rand(10000, 2)
julia> T = 1.0
julia> @btime toPolar!($x)
  232.784 μs (0 allocations: 0 bytes)
julia> @btime toCartesian!($x)
  101.394 μs (0 allocations: 0 bytes)
julia> @btime move!($x, $v, $T)
  14.378 μs (0 allocations: 0 bytes)
julia> @btime outerFunction!($x, $v, $T)
  612.012 μs (0 allocations: 0 bytes)

I would have expected the total timing to be approximately 233 + 101 + 14 = 350 μs

Any help is welcome!

I suspect it has to do with some of the data fitting into a CPU cache with the individual benchmarks, but only a higher level cache for the combined ones. I can replicate the phenomenon on 1.4, with multiple runs giving somewhat inconsistent timings.

In any case, making x and v 10x larger resolves the inconsistency for me (you may have to increase it more if you have a recent desktop CPU, mine is a puny laptop CPU with little cache) eg

julia> @btime toPolar!($x)
  4.520 ms (0 allocations: 0 bytes)

julia> @btime toCartesian!($x)
  1.284 ms (0 allocations: 0 bytes)

julia> @btime move!($x, $v, $T)
  148.501 μs (0 allocations: 0 bytes)

julia> @btime outerFunction!($x, $v, $T)
  5.503 ms (0 allocations: 0 bytes)

what the relevant benchmark is depends on your data size I guess.

Also, cf

5 Likes

Thanks for the quick answer ! This is very interesting, CPUs are so complex… Yet I tried 10x (so 100 000x2 arrays) and it does not change anything on my machine. (All timing are multiplied by 10).

julia> @btime toPolar!($x)
  2.472 ms (0 allocations: 0 bytes)

julia> @btime toCartesian!($x)
  933.475 μs (0 allocations: 0 bytes)

julia> @btime move!($x, $v, $T)
  143.459 μs (0 allocations: 0 bytes)

julia> @btime outerFunction!($x, $v, $T)
  5.947 ms (0 allocations: 0 bytes)

Trying x100 give even weirder results (three times slower):

julia> @btime toPolar!($x)
  15.412 ms (0 allocations: 0 bytes)

julia> @btime toCartesian!($x)
  6.054 ms (0 allocations: 0 bytes)

julia> @btime move!($x, $v, $T)
  1.736 ms (0 allocations: 0 bytes)

julia> @btime outerFunction!($x, $v, $T)
  60.322 ms (0 allocations: 0 bytes)

My Machine:
MacBook Pro (Retina, 15 inches, end 2013)
CPU: 2,3 GHz Intel Core i7 quad core
RAM: 16 Go 1600 MHz DDR3

I think some of this is caused by subnormal numbers. Try Base.set_zero_subnormals(true). But I think the runtime of some of these functions is dependent on the inputs anyway.

julia> @btime toPolar!($x)
  21.843 μs (0 allocations: 0 bytes)

julia> @btime toCartesian!($x)
  36.984 μs (0 allocations: 0 bytes)

julia> @btime move!($x, $v, $T)
  10.899 μs (0 allocations: 0 bytes)

julia> @btime outerFunction!($x, $v, $T)
  396.177 μs (0 allocations: 0 bytes)

julia> @btime outerFunction!($x, $v, $T)
  390.866 μs (0 allocations: 0 bytes)

julia> x'
2×10000 LinearAlgebra.Adjoint{Float64,Array{Float64,2}}:
 162036.0       286048.0       218714.0      289972.0      195339.0      267310.0      48427.5      …  129643.0       44890.2      178622.0       150361.0      323757.0      332012.0
     -0.784077      -0.273904       2.78728      -1.11657      -2.45901      -2.97896      2.07389          0.495497      1.90864       0.602863       1.41974       1.85946       0.504327

julia> xorig = rand(10_000, 2);

julia> @btime outerFunction!(copyto!($x, $xorig), $v, $T)
  271.576 μs (0 allocations: 0 bytes)

julia> x'
2×10000 LinearAlgebra.Adjoint{Float64,Array{Float64,2}}:
 1.24306  1.12502   1.02377   1.06599   1.5638    1.56081   0.897093  0.923178  1.75841   0.929802  …  1.9735  1.60316   0.829732  0.937007  0.949524  1.06568   1.52295  0.975263  1.72918
 0.65567  0.539637  0.810454  0.539083  0.715913  0.878484  0.706034  0.251734  0.712464  0.498144     0.5842  0.202117  0.777766  0.883492  0.923131  0.803376  1.02457  0.753723  0.763108

Also, if you’re interested in micro optimizing this, something you can try:

using LoopVectorization

function toPolar!(y, x)
    @avx for i = 1:size(x, 1)
        y[i, 1] = sqrt(abs2(x[i, 1]) + abs2(x[i, 2]))
        y[i, 2] = atan(x[i, 2], x[i, 1])
    end
end
toPolar!(x) = toPolar!(x, x)

function toCartesian!(y, x)
    @avx for i = 1:size(x, 1)
        y[i, 1] = x[i, 1] * cos(x[i, 2])
        y[i, 2] = x[i, 1] * sin(x[i, 2])
    end
end
toCartesian!(x) = toCartesian!(x, x)

function move!(x, v, T)
    @inbounds @simd ivdep for i = 1:size(x, 1)
        x[i, 1] += v[i, 1] * T
        x[i, 2] += v[i, 2] * T
    end
end

There’s a bug in LoopVectorization that I’m still trying to diagnose that causes it to give wrong answers when you load and store to the same variable in these examples, hence the above workaround.

It does make the benchmarks more convenient (not having to reset x, so you can avoid the Base.set_zero_subnormals(true):

julia> @btime toPolar!($x, $xorig)
  26.319 μs (0 allocations: 0 bytes)

julia> @btime toCartesian!($x, $xorig)
  14.751 μs (0 allocations: 0 bytes)

julia> @btime move!($x, $v, $T)
  2.634 μs (0 allocations: 0 bytes)

julia> @btime outerFunction!($x, $v, $T)
  45.793 μs (0 allocations: 0 bytes)

julia> @btime outerFunction!(copyto!($x, $xorig), $v, $T)
  47.594 μs (0 allocations: 0 bytes)
1 Like

Since these functions mutating their arguments, this benchmarking is wrong.

On my laptop

x = rand(10000, 2)
v = rand(10000, 2)

@btime toPolar!($x)
# 468.735 μs (0 allocations: 0 bytes)

@btime move!($x, $v, $T)
# 13.270 μs

# Here I use the same x, only it is already mutated after previous `move` calculation.
@btime toPolar!($x)
# 29.086 μs

In your original setup, benchmark runs multiple times toPolar then toCartesian then move. Since arguments mutate it’s not the same as running toCartesian!, move!, toPolar! sequence. You should use setup (https://github.com/JuliaCI/BenchmarkTools.jl/blob/master/doc/manual.md#setup-and-teardown-phases)

x = rand(10000, 2)
v = rand(10000, 2)
T = 1.0

@btime toPolar!(x1) setup=(x1 = copy($x))
# 215.132 μs (0 allocations: 0 bytes)
@btime move!(x1, v1, T1) setup=(x1 = copy($x); v1 = copy($v); T1 = copy($T))
# 13.290 μs (0 allocations: 0 bytes)
@btime toCartesian!(x1) setup=(x1 = copy($x))
# 113.838 μs (0 allocations: 0 bytes)

@btime outerFunction!(x1, v1, T1) setup=(x1 = copy($x); v1 = copy($v); T1 = copy($T))
# 331.737 μs (0 allocations: 0 bytes)
1 Like

Interesting. I thought that whatever the Array{Float64,2} those function would have in input, they would execute with the same time. It seems not to be the case… @Elrod This might be due to som subnormal numbers. @Skoffer Yeah I also realize that my benchmarking were depending on when I run them. What you propose makes a lot of sense! This amazing setup thing is the right way to benchmark functions that mutate their arguments! This solved my issue.

@Elrod Amazing optimization tricks! I’ll definitively have a look to LoopVectorization.jl. Your code gave me almost 3x overall performance improvement.

Thanks all, I learned a lot today.

1 Like

FYI, remember that you often want to set evals=1 when using setup with mutating functions in BenchmarkTools. See the discussion in BenchmarkTools setup isn't run between each iteration? and the manual section in https://github.com/JuliaCI/BenchmarkTools.jl/blob/master/doc/manual.md#setup-and-teardown-phases

3 Likes

@rdeits Just tried it, it works even better like that ! Thanks.