Crossposting this from zulip to reach more people and because it’s a getting a bit long.
tl;dr
After upgrading to 1.6 my code runs 50 times slower, it seems to be caused by broadcasting with calls to materialize
increasing to about 140x their number in 1.5.4.
Nikolai: I just restarted my development julia REPL for the first time since updating to 1.6.0 and my suddenly my code runs abysmally slow.
Luckily I have another REPL open with 1.5.4 so I was able to time the same code running under both version. 1.6.0 is close to 50 times slower.
I’ll do some profiling and try to figure out what has gotten so much slower. In the mean time if anyone has ideas about what might have happened or other advice please let me know.
(It’s particle/kernel based ML code, most of the heavy work is differentiation and matrix inversion)
Nikolai: _methods_by_ftype
seems to be taking up all of the time
Nikolai: My function grad_log_likehood
apparently calls materialize
which is something to do with broadcasting and that is what takes up all the time
Nikolai: Comparing the profiler between the version: 1.5.4 makes 241 calls to materialize
from grad_log_likelihood
while 1.6.0 makes 34678 calls for the same code
Nikolai: Broadcasting seems to be a lot worse in 1.6, running
t = rand( n_samples)
@benchmark t .- t
t = rand(1, n_samples)
@benchmark t .- t
In 1.6 produces
julia> t = rand( n_samples)
10-element Vector{Float64}:
0.09906767837656405
0.47702497557296364
0.27244814554430064
0.5818925794685812
0.5914572676458125
0.8520456245803281
0.39588600485727743
0.37298406444837795
0.023069872140623504
0.802804963181807
julia> @benchmark t .- t
BenchmarkTools.Trial:
memory estimate: 224 bytes
allocs estimate: 3
--------------
minimum time: 388.323 ns (0.00% GC)
median time: 490.104 ns (0.00% GC)
mean time: 515.128 ns (3.45% GC)
maximum time: 36.904 μs (98.37% GC)
--------------
samples: 10000
evals/sample: 198
julia> t = rand(1, n_samples)
1×10 Matrix{Float64}:
0.43367 0.665772 0.492303 0.878866 … 0.570874 0.199838 0.0904479
julia> @benchmark t .- t
BenchmarkTools.Trial:
memory estimate: 224 bytes
allocs estimate: 3
--------------
minimum time: 731.252 ns (0.00% GC)
median time: 921.413 ns (0.00% GC)
mean time: 942.107 ns (1.59% GC)
maximum time: 51.177 μs (98.02% GC)
--------------
samples: 10000
evals/sample: 143
while in 1.5.4 it gives
julia> t = rand( n_samples)
10-element Array{Float64,1}:
0.2734235025313967
0.2527912981883662
0.5783187947367492
0.9678137280948695
0.5371453605927523
0.21785488024442845
0.440081324980073
0.0990831893972055
0.8702855813288388
0.4049051169846596
julia> @benchmark t .- t
BenchmarkTools.Trial:
memory estimate: 224 bytes
allocs estimate: 3
--------------
minimum time: 301.141 ns (0.00% GC)
median time: 380.618 ns (0.00% GC)
mean time: 397.128 ns (3.62% GC)
maximum time: 73.303 μs (99.23% GC)
--------------
samples: 10000
evals/sample: 262
julia> t = rand(1, n_samples)
1×10 Array{Float64,2}:
0.240928 0.592313 0.258389 0.933685 … 0.611324 0.361393 0.234236
julia> @benchmark t .- t
BenchmarkTools.Trial:
memory estimate: 224 bytes
allocs estimate: 3
--------------
minimum time: 345.243 ns (0.00% GC)
median time: 436.333 ns (0.00% GC)
mean time: 461.008 ns (3.85% GC)
maximum time: 89.437 μs (99.35% GC)
--------------
samples: 10000
evals/sample: 230
So for a column vector there is a slight performance difference but for a row vector it’s almost a factor of two.
(It does seem to be overhead, the difference is a lot smaller for longer vectors)