Large performance regression of vectorized functions on master

While upgrading this package to Julia 0.7, I noticed a quite large performance regression in vectorized operations. This is the script I used for benchmarking:

n = 12500

Ch = rand(n)
Sh = rand(n)
C2 = rand(n)
S2 = rand(n)
C  = rand(n)
S  = rand(n)

function test!(P, Ch, Sh, C2, S2, C, S, YY)
    tan_2ωτ = @. (S2 - 2 * S * C) / (C2 - (C * C - S * S))
    C2w = @. 1 / (sqrt(1 + tan_2ωτ * tan_2ωτ)) # = cos(2 * ωτ)
    S2w = @. tan_2ωτ * C2w # = sin(2 * ωτ)
    Cw  = @. sqrt((1 + C2w) / 2) # = cos(ωτ)
    Sw  = @. sign(S2w) * sqrt((1 - C2w) / 2) # = sin(ωτ)
    return P .= @. ((Ch * Cw + Sh * Sw) ^ 2 /
                    ((1 + C2 * C2w + S2 * S2w) / 2 - (C * Cw + S * Sw) ^ 2) +
                    (Sh * Cw - Ch * Sw) ^ 2 /
                    ((1 - C2 * C2w - S2 * S2w) / 2 - (S * Cw - C * Sw) ^ 2)) / YY

@benchmark test!(P, $Ch, $Sh, $C2, $S2, $C, $S, 3.14) setup=(P  = Vector{Float64}(n))

On Julia 0.6.4:

  memory estimate:  489.09 KiB
  allocs estimate:  15
  minimum time:     447.034 μs (0.00% GC)
  median time:      457.194 μs (0.00% GC)
  mean time:        477.651 μs (2.20% GC)
  maximum time:     1.305 ms (52.66% GC)
  samples:          10000
  evals/sample:     1

On master (updated yesterday):

  memory estimate:  491.78 KiB
  allocs estimate:  137
  minimum time:     750.140 μs (0.00% GC)
  median time:      765.811 μs (0.00% GC)
  mean time:        788.548 μs (2.37% GC)
  maximum time:     42.660 ms (98.12% GC)
  samples:          5519
  evals/sample:     1

julia> versioninfo()
Julia Version 0.7.0-beta2.98
Commit 77a4cb5a07 (2018-07-24 21:03 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, haswell)

This is more than 60% slowdown. The function allocates also a lot more. It can be replicated with any value of n, including 1. Is this a known issue? I found other regressions in my package, but this one looks serious to me.

Thanks, I looked for “performance” & “regression” tags on GitHub, it’s missing the regression one :sweat_smile:

That issue compares broadcasting with a for loop and doesn’t mention a regression in broad popcasting itself vs 0.6? How is it related?

Does pulling the huge broadcast expression out in a separate function that is being broadcasted help?

Because it’s a regression from v0.6. It was found because this speed difference from a loop was not as pronounced in v0.6, so you can either see it as “broadcast is way worse than for loops on v0.7” or “broadcast is way worse on v0.7 than v0.6”: it’s the same thing.

And for reference, it was first discussed as a regression from v0.6:

(Gitter archives are fantastic and we should consider them for the Slack :smile:)