Huge Performance drop after upgrading to 1.6

Crossposting this from zulip to reach more people and because it’s a getting a bit long.

tl;dr
After upgrading to 1.6 my code runs 50 times slower, it seems to be caused by broadcasting with calls to materialize increasing to about 140x their number in 1.5.4.

Nikolai: I just restarted my development julia REPL for the first time since updating to 1.6.0 and my suddenly my code runs abysmally slow.
Luckily I have another REPL open with 1.5.4 so I was able to time the same code running under both version. 1.6.0 is close to 50 times slower.

I’ll do some profiling and try to figure out what has gotten so much slower. In the mean time if anyone has ideas about what might have happened or other advice please let me know.
(It’s particle/kernel based ML code, most of the heavy work is differentiation and matrix inversion)

Nikolai: _methods_by_ftype seems to be taking up all of the time

Nikolai: My function grad_log_likehood apparently calls materialize which is something to do with broadcasting and that is what takes up all the time

Nikolai: Comparing the profiler between the version: 1.5.4 makes 241 calls to materialize from grad_log_likelihood while 1.6.0 makes 34678 calls for the same code

Nikolai: Broadcasting seems to be a lot worse in 1.6, running

t = rand( n_samples)
@benchmark t .- t
t = rand(1, n_samples)
@benchmark t .- t

In 1.6 produces

julia> t = rand( n_samples)
10-element Vector{Float64}:
 0.09906767837656405
 0.47702497557296364
 0.27244814554430064
 0.5818925794685812
 0.5914572676458125
 0.8520456245803281
 0.39588600485727743
 0.37298406444837795
 0.023069872140623504
 0.802804963181807

julia> @benchmark t .- t
BenchmarkTools.Trial:
  memory estimate:  224 bytes
  allocs estimate:  3
  --------------
  minimum time:     388.323 ns (0.00% GC)
  median time:      490.104 ns (0.00% GC)
  mean time:        515.128 ns (3.45% GC)
  maximum time:     36.904 μs (98.37% GC)
  --------------
  samples:          10000
  evals/sample:     198

julia> t = rand(1, n_samples)
1×10 Matrix{Float64}:
 0.43367  0.665772  0.492303  0.878866  …  0.570874  0.199838  0.0904479

julia> @benchmark t .- t
BenchmarkTools.Trial:
  memory estimate:  224 bytes
  allocs estimate:  3
  --------------
  minimum time:     731.252 ns (0.00% GC)
  median time:      921.413 ns (0.00% GC)
  mean time:        942.107 ns (1.59% GC)
  maximum time:     51.177 μs (98.02% GC)
  --------------
  samples:          10000
  evals/sample:     143

while in 1.5.4 it gives

julia> t = rand( n_samples)
10-element Array{Float64,1}:
 0.2734235025313967
 0.2527912981883662
 0.5783187947367492
 0.9678137280948695
 0.5371453605927523
 0.21785488024442845
 0.440081324980073
 0.0990831893972055
 0.8702855813288388
 0.4049051169846596

julia> @benchmark t .- t
BenchmarkTools.Trial:
  memory estimate:  224 bytes
  allocs estimate:  3
  --------------
  minimum time:     301.141 ns (0.00% GC)
  median time:      380.618 ns (0.00% GC)
  mean time:        397.128 ns (3.62% GC)
  maximum time:     73.303 μs (99.23% GC)
  --------------
  samples:          10000
  evals/sample:     262

julia> t = rand(1, n_samples)
1×10 Array{Float64,2}:
 0.240928  0.592313  0.258389  0.933685  …  0.611324  0.361393  0.234236

julia> @benchmark t .- t
BenchmarkTools.Trial:
  memory estimate:  224 bytes
  allocs estimate:  3
  --------------
  minimum time:     345.243 ns (0.00% GC)
  median time:      436.333 ns (0.00% GC)
  mean time:        461.008 ns (3.85% GC)
  maximum time:     89.437 μs (99.35% GC)
  --------------
  samples:          10000
  evals/sample:     230

So for a column vector there is a slight performance difference but for a row vector it’s almost a factor of two.
(It does seem to be overhead, the difference is a lot smaller for longer vectors)

You’re benchmarking it wrong.

Julia v1.5.4:

julia> using BenchmarkTools

julia> t = rand(10);

julia> @benchmark $t .- $t
BenchmarkTools.Trial: 
  memory estimate:  160 bytes
  allocs estimate:  1
  --------------
  minimum time:     34.528 ns (0.00% GC)
  median time:      35.873 ns (0.00% GC)
  mean time:        41.562 ns (2.70% GC)
  maximum time:     542.761 ns (87.48% GC)
  --------------
  samples:          10000
  evals/sample:     993

Julia v1.6.0:

julia> using BenchmarkTools

julia> t = rand(10);

julia> @benchmark $t .- $t
BenchmarkTools.Trial: 
  memory estimate:  160 bytes
  allocs estimate:  1
  --------------
  minimum time:     35.194 ns (0.00% GC)
  median time:      39.461 ns (0.00% GC)
  mean time:        46.844 ns (2.67% GC)
  maximum time:     1.026 μs (91.90% GC)
  --------------
  samples:          10000
  evals/sample:     992

You need to interpolate global variables

3 Likes

Thanks, I didn’t know that.
In your benchmark 1.6 is still notably slower though.
But it’s probably not just due to broadcasting then, but that isn’t really the main problem.

Is a less-than-2-percent slowdown “notably slower”? That’s very well within noise, if I repeat the benchmarks I get

# julia v1.5.4
julia> @benchmark $t .- $t
BenchmarkTools.Trial: 
  memory estimate:  160 bytes
  allocs estimate:  1
  --------------
  minimum time:     33.160 ns (0.00% GC)
  median time:      34.131 ns (0.00% GC)
  mean time:        37.565 ns (2.95% GC)
  maximum time:     532.654 ns (89.63% GC)
  --------------
  samples:          10000
  evals/sample:     994

# Julia v1.6.0
julia> @benchmark $t .- $t
BenchmarkTools.Trial: 
  memory estimate:  160 bytes
  allocs estimate:  1
  --------------
  minimum time:     33.948 ns (0.00% GC)
  median time:      37.906 ns (0.00% GC)
  mean time:        41.071 ns (2.72% GC)
  maximum time:     548.285 ns (84.47% GC)
  --------------
  samples:          10000
  evals/sample:     993
2 Likes

Not sure where you get <2%. Both mean and median times are ~10% longer in your 1.6 benchmark.
Anyway, not the point.

The minimum time, which is what BenchmarkTools cares about, as detailed in their paper.

3 Likes

Here are screenshots of the output from StatProfilerHTML

Seeing that stacktrace I have a hunch that this might be the same type of regression as in Performance regression when recursively traversing nested types. · Issue #40302 · JuliaLang/julia · GitHub. In that case inference: SCC handling missing for return_type_tfunc (#39375) · JuliaLang/julia@fd8f97e · GitHub had a bad impact on the performance. You might want to try revert that and see if it helps.

5 Likes

A hell of a hunch!, this was it.
At tag 1.6.0 I reverted the commit you linked and the performance returned to what it had been at 1.5.4.

4 Likes

:cool:

I was able to get my code running at normal speeds in 1.6 by removing a function call that returned an anonymous function.

As far as I understood those links the problem was unknown return types of functions.
But since the offending commit was actually a fix for something else, does that mean this a kind of lucky bug where breaking one thing allowed another to run fast?
Should I create an MWE?

Could be, I am not very familiar with the code that commit touches.

I think that would be useful indeed.

1 Like

Created an issue for this here: Performance regression returning anonymous function from struct · Issue #40606 · JuliaLang/julia · GitHub

3 Likes

Thanks!