Huge Performance drop after upgrading to 1.6

imLew · April 22, 2021, 9:42am

Crossposting this from zulip to reach more people and because it’s a getting a bit long.

tl;dr
After upgrading to 1.6 my code runs 50 times slower, it seems to be caused by broadcasting with calls to materialize increasing to about 140x their number in 1.5.4.

Nikolai: I just restarted my development julia REPL for the first time since updating to 1.6.0 and my suddenly my code runs abysmally slow.
Luckily I have another REPL open with 1.5.4 so I was able to time the same code running under both version. 1.6.0 is close to 50 times slower.

I’ll do some profiling and try to figure out what has gotten so much slower. In the mean time if anyone has ideas about what might have happened or other advice please let me know.
(It’s particle/kernel based ML code, most of the heavy work is differentiation and matrix inversion)

Nikolai: _methods_by_ftype seems to be taking up all of the time

Nikolai: My function grad_log_likehood apparently calls materialize which is something to do with broadcasting and that is what takes up all the time

Nikolai: Comparing the profiler between the version: 1.5.4 makes 241 calls to materialize from grad_log_likelihood while 1.6.0 makes 34678 calls for the same code

Nikolai: Broadcasting seems to be a lot worse in 1.6, running

t = rand( n_samples)
@benchmark t .- t
t = rand(1, n_samples)
@benchmark t .- t

In 1.6 produces

julia> t = rand( n_samples)
10-element Vector{Float64}:
 0.09906767837656405
 0.47702497557296364
 0.27244814554430064
 0.5818925794685812
 0.5914572676458125
 0.8520456245803281
 0.39588600485727743
 0.37298406444837795
 0.023069872140623504
 0.802804963181807

julia> @benchmark t .- t
BenchmarkTools.Trial:
  memory estimate:  224 bytes
  allocs estimate:  3
  --------------
  minimum time:     388.323 ns (0.00% GC)
  median time:      490.104 ns (0.00% GC)
  mean time:        515.128 ns (3.45% GC)
  maximum time:     36.904 μs (98.37% GC)
  --------------
  samples:          10000
  evals/sample:     198

julia> t = rand(1, n_samples)
1×10 Matrix{Float64}:
 0.43367  0.665772  0.492303  0.878866  …  0.570874  0.199838  0.0904479

julia> @benchmark t .- t
BenchmarkTools.Trial:
  memory estimate:  224 bytes
  allocs estimate:  3
  --------------
  minimum time:     731.252 ns (0.00% GC)
  median time:      921.413 ns (0.00% GC)
  mean time:        942.107 ns (1.59% GC)
  maximum time:     51.177 μs (98.02% GC)
  --------------
  samples:          10000
  evals/sample:     143

while in 1.5.4 it gives

julia> t = rand( n_samples)
10-element Array{Float64,1}:
 0.2734235025313967
 0.2527912981883662
 0.5783187947367492
 0.9678137280948695
 0.5371453605927523
 0.21785488024442845
 0.440081324980073
 0.0990831893972055
 0.8702855813288388
 0.4049051169846596

julia> @benchmark t .- t
BenchmarkTools.Trial:
  memory estimate:  224 bytes
  allocs estimate:  3
  --------------
  minimum time:     301.141 ns (0.00% GC)
  median time:      380.618 ns (0.00% GC)
  mean time:        397.128 ns (3.62% GC)
  maximum time:     73.303 μs (99.23% GC)
  --------------
  samples:          10000
  evals/sample:     262

julia> t = rand(1, n_samples)
1×10 Array{Float64,2}:
 0.240928  0.592313  0.258389  0.933685  …  0.611324  0.361393  0.234236

julia> @benchmark t .- t
BenchmarkTools.Trial:
  memory estimate:  224 bytes
  allocs estimate:  3
  --------------
  minimum time:     345.243 ns (0.00% GC)
  median time:      436.333 ns (0.00% GC)
  mean time:        461.008 ns (3.85% GC)
  maximum time:     89.437 μs (99.35% GC)
  --------------
  samples:          10000
  evals/sample:     230

So for a column vector there is a slight performance difference but for a row vector it’s almost a factor of two.
(It does seem to be overhead, the difference is a lot smaller for longer vectors)

giordano · April 22, 2021, 9:51am

You’re benchmarking it wrong.

Julia v1.5.4:

julia> using BenchmarkTools

julia> t = rand(10);

julia> @benchmark $t .- $t
BenchmarkTools.Trial: 
  memory estimate:  160 bytes
  allocs estimate:  1
  --------------
  minimum time:     34.528 ns (0.00% GC)
  median time:      35.873 ns (0.00% GC)
  mean time:        41.562 ns (2.70% GC)
  maximum time:     542.761 ns (87.48% GC)
  --------------
  samples:          10000
  evals/sample:     993

Julia v1.6.0:

julia> using BenchmarkTools

julia> t = rand(10);

julia> @benchmark $t .- $t
BenchmarkTools.Trial: 
  memory estimate:  160 bytes
  allocs estimate:  1
  --------------
  minimum time:     35.194 ns (0.00% GC)
  median time:      39.461 ns (0.00% GC)
  mean time:        46.844 ns (2.67% GC)
  maximum time:     1.026 μs (91.90% GC)
  --------------
  samples:          10000
  evals/sample:     992

You need to interpolate global variables

imLew · April 22, 2021, 9:58am

Thanks, I didn’t know that.
In your benchmark 1.6 is still notably slower though.
But it’s probably not just due to broadcasting then, but that isn’t really the main problem.

giordano · April 22, 2021, 10:03am

Is a less-than-2-percent slowdown “notably slower”? That’s very well within noise, if I repeat the benchmarks I get

# julia v1.5.4
julia> @benchmark $t .- $t
BenchmarkTools.Trial: 
  memory estimate:  160 bytes
  allocs estimate:  1
  --------------
  minimum time:     33.160 ns (0.00% GC)
  median time:      34.131 ns (0.00% GC)
  mean time:        37.565 ns (2.95% GC)
  maximum time:     532.654 ns (89.63% GC)
  --------------
  samples:          10000
  evals/sample:     994

# Julia v1.6.0
julia> @benchmark $t .- $t
BenchmarkTools.Trial: 
  memory estimate:  160 bytes
  allocs estimate:  1
  --------------
  minimum time:     33.948 ns (0.00% GC)
  median time:      37.906 ns (0.00% GC)
  mean time:        41.071 ns (2.72% GC)
  maximum time:     548.285 ns (84.47% GC)
  --------------
  samples:          10000
  evals/sample:     993

imLew · April 22, 2021, 10:12am

Not sure where you get <2%. Both mean and median times are ~10% longer in your 1.6 benchmark.
Anyway, not the point.

giordano · April 22, 2021, 10:19am

The minimum time, which is what BenchmarkTools cares about, as detailed in their paper.

imLew · April 22, 2021, 10:19am

Here are screenshots of the output from StatProfilerHTML

kristoffer.carlsson · April 22, 2021, 10:24am

Seeing that stacktrace I have a hunch that this might be the same type of regression as in Performance regression when recursively traversing nested types. · Issue #40302 · JuliaLang/julia · GitHub. In that case inference: SCC handling missing for return_type_tfunc (#39375) · JuliaLang/julia@fd8f97e · GitHub had a bad impact on the performance. You might want to try revert that and see if it helps.

imLew · April 22, 2021, 3:42pm

A hell of a hunch!, this was it.
At tag 1.6.0 I reverted the commit you linked and the performance returned to what it had been at 1.5.4.

kristoffer.carlsson · April 22, 2021, 6:02pm

imLew · April 23, 2021, 9:14am

I was able to get my code running at normal speeds in 1.6 by removing a function call that returned an anonymous function.

As far as I understood those links the problem was unknown return types of functions.
But since the offending commit was actually a fix for something else, does that mean this a kind of lucky bug where breaking one thing allowed another to run fast?
Should I create an MWE?

kristoffer.carlsson · April 23, 2021, 9:45am

Could be, I am not very familiar with the code that commit touches.

I think that would be useful indeed.

imLew · April 26, 2021, 8:54am

Created an issue for this here: Performance regression returning anonymous function from struct · Issue #40606 · JuliaLang/julia · GitHub

kristoffer.carlsson · April 26, 2021, 9:08am

Thanks!

Topic		Replies	Views
Is there way to avoid `materialize` and `copy`? Performance	20	3342	November 15, 2018
Large performance regression of vectorized functions on master General Usage	4	635	July 25, 2018
Sum, mapreduce and broadcasted Performance broadcast	14	2990	September 23, 2018
Kron vs scalar product speed difference. python code faster? New to Julia question	41	4243	January 14, 2017
Drop of performances with Julia 1.6.0 for InterpolationKernels Performance	40	2663	April 1, 2021

Huge Performance drop after upgrading to 1.6

Related topics