Large performance regression of vectorized functions on master

giordano · July 25, 2018, 8:49pm

While upgrading this package to Julia 0.7, I noticed a quite large performance regression in vectorized operations. This is the script I used for benchmarking:

n = 12500

Ch = rand(n)
Sh = rand(n)
C2 = rand(n)
S2 = rand(n)
C  = rand(n)
S  = rand(n)

function test!(P, Ch, Sh, C2, S2, C, S, YY)
    tan_2ωτ = @. (S2 - 2 * S * C) / (C2 - (C * C - S * S))
    C2w = @. 1 / (sqrt(1 + tan_2ωτ * tan_2ωτ)) # = cos(2 * ωτ)
    S2w = @. tan_2ωτ * C2w # = sin(2 * ωτ)
    Cw  = @. sqrt((1 + C2w) / 2) # = cos(ωτ)
    Sw  = @. sign(S2w) * sqrt((1 - C2w) / 2) # = sin(ωτ)
    return P .= @. ((Ch * Cw + Sh * Sw) ^ 2 /
                    ((1 + C2 * C2w + S2 * S2w) / 2 - (C * Cw + S * Sw) ^ 2) +
                    (Sh * Cw - Ch * Sw) ^ 2 /
                    ((1 - C2 * C2w - S2 * S2w) / 2 - (S * Cw - C * Sw) ^ 2)) / YY
end

@benchmark test!(P, $Ch, $Sh, $C2, $S2, $C, $S, 3.14) setup=(P  = Vector{Float64}(n))

On Julia 0.6.4:

BenchmarkTools.Trial: 
  memory estimate:  489.09 KiB
  allocs estimate:  15
  --------------
  minimum time:     447.034 μs (0.00% GC)
  median time:      457.194 μs (0.00% GC)
  mean time:        477.651 μs (2.20% GC)
  maximum time:     1.305 ms (52.66% GC)
  --------------
  samples:          10000
  evals/sample:     1

On master (updated yesterday):

BenchmarkTools.Trial: 
  memory estimate:  491.78 KiB
  allocs estimate:  137
  --------------
  minimum time:     750.140 μs (0.00% GC)
  median time:      765.811 μs (0.00% GC)
  mean time:        788.548 μs (2.37% GC)
  maximum time:     42.660 ms (98.12% GC)
  --------------
  samples:          5519
  evals/sample:     1

julia> versioninfo()
Julia Version 0.7.0-beta2.98
Commit 77a4cb5a07 (2018-07-24 21:03 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, haswell)

This is more than 60% slowdown. The function allocates also a lot more. It can be replicated with any value of n, including 1. Is this a known issue? I found other regressions in my package, but this one looks serious to me.

ChrisRackauckas · July 25, 2018, 8:51pm

https://github.com/JuliaLang/julia/issues/28126

giordano · July 25, 2018, 8:54pm

Thanks, I looked for “performance” & “regression” tags on GitHub, it’s missing the regression one

kristoffer.carlsson · July 25, 2018, 9:12pm

That issue compares broadcasting with a for loop and doesn’t mention a regression in broad popcasting itself vs 0.6? How is it related?

Does pulling the huge broadcast expression out in a separate function that is being broadcasted help?

ChrisRackauckas · July 25, 2018, 9:15pm

Because it’s a regression from v0.6. It was found because this speed difference from a loop was not as pronounced in v0.6, so you can either see it as “broadcast is way worse than for loops on v0.7” or “broadcast is way worse on v0.7 than v0.6”: it’s the same thing.

And for reference, it was first discussed as a regression from v0.6:

https://gitter.im/JuliaDiffEq/Lobby?at=5b4b18b4582aaa63076c2c14

(Gitter archives are fantastic and we should consider them for the Slack )

Topic		Replies	Views
Blog post: Loop fusion and vectorization in Julia 0.6 Internals & Design announcement , broadcast	28	8517	May 4, 2017
Kron vs scalar product speed difference. python code faster? New to Julia question	41	4243	January 14, 2017
When should I write loops or vectorised calls? General Usage	17	1867	December 1, 2020
Performance regression in 1.0.1 Performance	5	905	October 2, 2018
Broadcast vs. scalar loop, can Julia vectorize better? Internals & Design	8	1969	February 15, 2020

Large performance regression of vectorized functions on master

Related topics