Same code run multiple times gives wildly different timings

goerch · February 23, 2022, 9:58pm

Here is a runaway case under profiler

Top entry is:

_turbo_!(::Val{(false, 0, 0, 0, false, 4, 32, 15, 64, 32768, 262144, 12582912, 0x0000000000000005)}, ::Val{(:LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x00000000000000000000000000000021, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.memload, 0x0001, 0x01), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.memload, 0x0002, 0x02), :LoopVectorization, :div_fast, LoopVectorization.OperationStruct(0x00000000000000000000000000000021, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000010002, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0003, 0x00), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.memload, 0x0004, 0x03), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.memload, 0x0005, 0x04), :numericconstant, Symbol(\"###reduction##zero###11###\"), LoopVectorization.OperationStruct(0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.constant, 0x0006, 0x00), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x00000000000000000000000000000021, 0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000300040006, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0006, 0x00), :LoopVectorization, :reduced_add, LoopVectorization.OperationStruct(0x00000000000000000000000000000002, 0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000070005, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0005, 0x00), :LoopVectorization, :setindex!, LoopVectorization.OperationStruct(0x00000000000000000000000000000002, 0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000008, 0x00000000000000000000000000000000, LoopVectorization.memstore, 0x0007, 0x04))}, ::Val{(LoopVectorization.ArrayRefStruct{:G, Symbol(\"##vptr##_G\")}(0x00000000000000000000000000000101, 0x00000000000000000000000000000201, 0x00000000000000000000000000000000, 0x00000000000000000000000000000101), LoopVectorization.ArrayRefStruct{:norm, Symbol(\"##vptr##_norm\")}(0x00000000000000000000000000000001, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000001), LoopVectorization.ArrayRefStruct{:data, Symbol(\"##vptr##_data\")}(0x00000000000000000000000000000001, 0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000001), LoopVectorization.ArrayRefStruct{:mu, Symbol(\"##vptr##_mu\")}(0x00000000000000000000000000000001, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000001))}, ::Val{(0, (), (), (), (), ((6, LoopVectorization.IntOrFloat),), ())}, ::Val{(:n, :k)}, ::Val{Tuple{Tuple{CloseOpenIntervals.CloseOpen{Static.StaticInt{0}, Int64}, CloseOpenIntervals.CloseOpen{Static.StaticInt{0}, Int64}}, Tuple{LayoutPointers.GroupedStridedPointers{NTuple{4, Ptr{Float64}}, (1, 1, 1, 1), (0, 0, 0, 0), ((1, 2), (1,), (1,), (1,)), ((1, 2), (3,), (4,), (5,)), Tuple{Static.StaticInt{8}, Int64, Static.StaticInt{8}, Static.StaticInt{8}, Static.StaticInt{8}}, NTuple{5, Static.StaticInt{0}}}}}}, ::Int64, ::Int64, ::Ptr{Float64}, ::Ptr{Float64}, ::Ptr{Float64}, ::Ptr{Float64}, ::Int64)

C:\Users\Win10\.julia\packages\LoopVectorization\tSQDi\src\reconstruct_loopset.jl

  Total:       14675      14675 (flat, cum) 35.14%
    713        14675      14675           @generated function _turbo_!( 
    714            .          .             ::Val{var"#UNROLL#"}, ::Val{var"#OPS#"}, ::Val{var"#ARF#"}, ::Val{var"#AM#"}, ::Val{var"#LPSYM#"}, ::Val{Tuple{var"#LB#",var"#V#"}}, var"#flattened#var#arguments#"::Vararg{Any,var"#num#vargs#"} 
    715            .          .           ) where {var"#UNROLL#", var"#OPS#", var"#ARF#", var"#AM#", var"#LPSYM#", var"#LB#", var"#V#", var"#num#vargs#"} 
    716            .          .             # 1 + 1 # Irrelevant line you can comment out/in to force recompilation... 
    717            .          .             ls = _turbo_loopset(var"#OPS#", var"#ARF#", var"#AM#", var"#LPSYM#", var"#LB#".parameters, var"#V#".parameters, var"#UNROLL#") 
    718            .          .             pushfirst!(ls.preamble.args, :(var"#lv#tuple#args#" = reassemble_tuple(Tuple{var"#LB#",var"#V#"}, var"#flattened#var#arguments#")))

Seems to be triggered from step_M!. This leads me to the question if

	evidences .= 0
    @tturbo for n ∈ 1:N, k ∈ 1:K
        evidences[k] += G[k, n]
	end

is race free? I’d propose to use something like

	evidences .= 0
    @tturbo for k ∈ 1:K
        for n ∈ 1:N
            evidences[k] += G[k, n]
        end
	end

instead, assuming it is equivalent (or better) than Threads.@threads for k ∈ 1:K…

ForceBru · February 23, 2022, 10:06pm

Yeah, the results look fine. Also, it essentially loops maxiter::Integer=10_000 times over the same code. There are no stopping conditions yet or anything that could make iteration 1 different from iteration 100 or whatever.

I tried it but couldn’t find anything new. Let me try again…

ForceBru · February 23, 2022, 10:17pm

I, for example, have no idea, unfortunately. I guess LoopVectorization does some magic to ensure that it’s fine? I’m also seeing uses of atomic-related stuff in my profile results and also some calls to functions called wait, so maybe this is used to ensure synchronization.

A more extreme case could be in the ELBO calculation:

ret = 0.0
@tturbo for n ∈ 1:N, k ∈ 1:K
	q = G[k, n]
	ret += q * (
		log(p[k]) - (log(2π) + log(var[k]) + (data[n] - mu[k])^2 / var[k]) / 2
		- log(q + 1e-100)
	)
end

I don’t know how it can update ret from multiple threads without causing a lot of race conditions. However, the documentation shows that summing over stuff is a valid use-case.

ForceBru · February 23, 2022, 10:21pm

Changed to this and immediately ran into a slowdown… I think LoopVectorization reorders the loops, so the Julia code doesn’t really reflect the actual computations that are being run.

Actually, wait, I changed to the wrong thing…

Nope, same slowdowns anyway. For some reason, the slowdowns look like this:

Everything starts running as fast as the nothing case, nice and smooth
At some point, it sort of trips up and the progress bar starts “struggling”, if you pardon my humanization of computers.
The ETA starts slowly increasing
This continues for about 60% of the progress bar, then the ETA starts decreasing
…aaaand it’s done, after like a minute or two

goerch · February 23, 2022, 10:22pm

As do I. So I’d assume the good case can access evidences[k] in parallel and the bad case has to synchronize…

Which has to synchronize access…

ForceBru · February 23, 2022, 10:35pm

They’re literally the same for step_M!, though: it accepts any MaybeReg, and the loop doesn’t depend on this parameter at all. Or, shouldn’t depend.

Let me just remove all differences between the two cases and have all functions run the same code regardless of reg…

EDIT: everything works fine if I ignore regularization:

finalize_variances!(var::AV{Float64}, norm::AV{Float64}, reg::Nothing) = @turbo var .= var ./ norm .+ 1e-6

finalize_variances!(var::AV{Float64}, norm::AV{Float64}, reg::InvGamma) =
    finalize_variances!(var, norm, nothing)

Makes total sense, I guess?

goerch · February 23, 2022, 11:15pm

Here is a result replacing @tturbo with @turbo on my 6 cores (12 threads):

This indicates to me that parallelization doesn’t scale well in our case, which in turn could mean we hit a border case for @tturbo.

Sorry, you lost me here…

ForceBru · February 23, 2022, 11:35pm

I mean, I removed all differences between the nothing case and the InvGamma case, and the timings look the same, which makes sense since the two cases became identical.

Tried @turbo (1 CPU core) instead of @tturbo:

Run 1: 00:48 vs 00:49
Run 2: 00:49 vs 01:17
Run 3: 00:48 vs 03:24 (!)

CPU usage was around 100% out of 400% (because I have 4 CPU cores total) and never dropped while running the code.

goerch · February 23, 2022, 11:46pm

Confirmed, off to profile @turbo tomorrow;)

mikkoku · February 24, 2022, 8:21am

Did you try fixing the random seed?

import Random
Random.seed!(123)
# Generate data

It appears that the timings are input dependent.

ForceBru · February 24, 2022, 10:51am

I tried that and couldn’t reproduce the issue after fixing the seed. However, I’m trying to check whether my code produces correct results, so I’m running it on random data on purpose. What’s the point of a parameter estimation algorithm if it only works with one particular data set?

I wanted to output the random seed that caused the slowdown, but couldn’t find a way of extracting it. Random.seed!() sets the seed, but apparently there’s no obvious way of getting the current seed. Why is there no Random.seed() that returns the current seed?

On Julia 1.7:

From here: random - Retrieve RNG seed in julia - Stack Overflow
- Base.Random.RANDOM_SEED throws UndefVarError: Random not defined
- After import Random: Random.RANDOM_SEED throws UndefVarError: RANDOM_SEED not defined
- Random.GLOBAL_RNG.seed says type _GLOBAL_RNG has no field seed
- Random.default_rng().seed says type TaskLocalRNG has no field seed
- Random.GLOBAL_SEED isn’t updated after I call rand(), so I guess there’s more random state behind the scenes…

mikkoku · February 24, 2022, 11:04am

You need to save the seed when you set it. You cannot obtain it afterwards. If the slowdown happens often you should be able to find a seed that produces a slowdown by just trying a few seeds. For me, seed!(1) did produce a slowdown, while seed!(2) did not. I agree it is good to test on random data, but I hope this would help locate the problem.

ForceBru · February 24, 2022, 11:19am

Indeed, Random.seed!(1) results in 00:45 for the nothing case and 01:21 for the InvGamma case. Another run shows 00:45 and 01:20.

BTW, this happens even without any involvement of LoopVectorization. I deleted all mentions of @turbo and friends and commented out using LoopVectorization and still got 01:36 vs 02:26 - almost a 1min difference!

Updated version w/o LoopVectorizaztion and lots of @asserts to ensure that everything is of the correct shape: https://gist.github.com/ForceBru/8e7f6940c3013e15f62a6c660a618c85#file-version_2_noloopvec-jl

Some timings:

01:30 (asserts slow things down) vs 02:20
01:29 vs 02:20
01:31 vs 02:20

lmiq · February 24, 2022, 1:15pm

I noticed the following: the slowdown starts when var assumes a specific value:

Progress:  70%|███████████████████████████████████████████████████████████████████▍                            |  ETA: 0:00:02(i, gmm.v
ar) = (142, [0.2454790079230406, 0.16000283547928387])
(i, gmm.var) = (143, [0.2393365282721729, 0.15443620866888444])
(i, gmm.var) = (144, [0.24284254899660437, 0.1619336945853131])
Progress:  72%|████████████████████████████████████████████████████████████████████▊                           |  ETA: 0:00:02(i, gmm.v
ar) = (145, [0.29729978746329194, 1.0])
(i, gmm.var) = (146, [0.2972146796343939, 1.0])
(i, gmm.var) = (147, [0.2942993215997909, 1.0])
Progress:  73%|██████████████████████████████████████████████████████████████████████▎                         |  ETA: 0:00:02(i, gmm.v
ar) = (148, [0.2929915147487294, 1.0])
Progress:  74%|██████████████████████████████████████████████████████████████████████▋                         |  ETA: 0:00:03(i, gmm.v
ar) = (149, [0.29300358886192024, 1.0])
Progress:  74%|███████████████████████████████████████████████████████████████████████▏

There it started when you have the second element of var == 1.0.

Maybe that gives you a hint on what is wrong there.

ForceBru · February 24, 2022, 1:33pm

Wow, indeed, it slows down as soon as there’s a k such that gmm.var[k] == 1.0.

Here I changed the fit! function to output the time it took to run the main loop:

timing = @timed for i ∈ 1:maxiter
	EM.step_E!(gmm.G, gmm.norm, gmm.p, gmm.mu, gmm.var, data, reg)
	EM.step_M!(gmm.G, gmm.norm, gmm.p, gmm.mu, gmm.var, data, reg)

	gmm.ELBO_hist[i] = EM.ELBO(gmm.G, gmm.p, gmm.mu, gmm.var, data, reg)
end

@show (gmm.var, timing.time)

Here’s the output from the InvGamma case:

[ Info: Fitting... (with regularization)
(gmm.var, timing.time) = ([0.12979600519462, 0.22091734277480923], 0.437090557)
(gmm.var, timing.time) = ([0.13081625184968007, 0.2206570705655502], 0.43813817)
(gmm.var, timing.time) = ([0.12994360015250062, 0.2205964244568769], 0.444398958)
(gmm.var, timing.time) = ([0.13205819345301345, 0.22466612385111753], 0.436146891)
(gmm.var, timing.time) = ([0.131606469707795, 0.22649328809623276], 0.439656724)
(gmm.var, timing.time) = ([0.13010806538724098, 0.23327644124826644], 0.437758792)
(gmm.var, timing.time) = ([0.1279562407054046, 0.23367085391900447], 0.437707178)
...
(gmm.var, timing.time) = ([0.24411286011714176, 0.16580655159485652], 0.448527361)
(gmm.var, timing.time) = ([0.2662679258567789, 0.19150053955326957], 0.446357033)
(gmm.var, timing.time) = ([0.2629225941275259, 0.18116854470156368], 0.447261846)
(gmm.var, timing.time) = ([0.26214244460585273, 0.17728904458942218], 0.448244723)
(gmm.var, timing.time) = ([0.26069692251129745, 0.17328849490587808], 0.446762994)
(gmm.var, timing.time) = ([0.29913480492029554, 1.0], 0.434144478)
(gmm.var, timing.time) = ([0.2998579815176012, 1.0], 1.305444)
(gmm.var, timing.time) = ([0.3004979961705274, 1.0], 1.707785367)
(gmm.var, timing.time) = ([0.3017003699016324, 1.0], 1.699075401)
(gmm.var, timing.time) = ([0.30120562536428414, 1.0], 1.688268186)
(gmm.var, timing.time) = ([0.3014536469939427, 1.0], 1.682149487)
(gmm.var, timing.time) = ([0.3044833045360511, 1.0], 1.682936661)
(gmm.var, timing.time) = ([0.30415011758494825, 1.0], 1.718251328)
(gmm.var, timing.time) = ([0.3039567622674634, 1.0], 1.71807702)
(gmm.var, timing.time) = ([0.3040352748957333, 1.0], 1.690801447)
(gmm.var, timing.time) = ([0.3041718322931663, 1.0], 1.684268279)
(gmm.var, timing.time) = ([0.3036123236862732, 1.0], 1.683556643)
(gmm.var, timing.time) = ([0.3018321975190529, 1.0], 1.685763053)
(gmm.var, timing.time) = ([0.3011188102006153, 1.0], 1.689479624)
(gmm.var, timing.time) = ([0.3034159553051776, 1.0], 1.69546642)
(gmm.var, timing.time) = ([0.3062028200798609, 1.0], 1.763214379)
(gmm.var, timing.time) = ([0.3061830104269247, 1.0], 1.915075644)
(gmm.var, timing.time) = ([0.3042153151797508, 1.0], 1.72388072)
(gmm.var, timing.time) = ([0.30873714160720517, 1.0], 1.765052766)
(gmm.var, timing.time) = ([0.3102281299289041, 1.0], 1.819851372)
(gmm.var, timing.time) = ([0.30781217250225834, 1.0], 1.711476321)
(gmm.var, timing.time) = ([0.3084804939882712, 1.0], 1.722908551)
(gmm.var, timing.time) = ([0.3096173941531654, 1.0], 1.749803121)
(gmm.var, timing.time) = ([0.3096099546780613, 1.0], 1.792705749)
(gmm.var, timing.time) = ([0.3094866765138834, 1.0], 1.756672981)
(gmm.var, timing.time) = ([0.30804567032480773, 1.0], 1.841761611)
(gmm.var, timing.time) = ([0.3063997050993591, 1.0], 1.798693229)
(gmm.var, timing.time) = ([0.30621467867612306, 1.0], 1.829438462)
(gmm.var, timing.time) = ([0.3055715213101416, 1.0], 1.732961334)
(gmm.var, timing.time) = ([0.30556030772413, 1.0], 1.732991028)
(gmm.var, timing.time) = ([0.3052487146454984, 1.0], 1.733896484)
(gmm.var, timing.time) = ([0.3045452389182044, 1.0], 1.750909186)
(gmm.var, timing.time) = ([0.3030495221465515, 1.0], 1.718392537)
(gmm.var, timing.time) = ([0.3028247679552457, 1.0], 1.733551232)
(gmm.var, timing.time) = ([0.3028697369563701, 1.0], 1.722950705)
(gmm.var, timing.time) = ([0.3026594808005801, 1.0], 1.725680657)
(gmm.var, timing.time) = ([0.30277800942558725, 1.0], 1.739890083)
(gmm.var, timing.time) = ([0.3039729625602777, 1.0], 1.728341708)
(gmm.var, timing.time) = ([0.3000833827259703, 1.0], 1.711552785)
(gmm.var, timing.time) = ([0.2998545902702496, 1.0], 1.740374686)
(gmm.var, timing.time) = ([0.2985882769517061, 1.0], 1.728837756)
(gmm.var, timing.time) = ([0.3004974524101312, 1.0], 1.725061656)
(gmm.var, timing.time) = ([0.30018938852066274, 1.0], 1.731628271)
(gmm.var, timing.time) = ([0.30075313157351014, 1.0], 1.761483254)

So, after some gmm.var becomes equal to 1.0, the main for i ∈ 1:maxiter loop slows down from 0.44 seconds to 1.7 seconds per the same 10’000 iterations!

WTF, though? Numbers are numbers - who cares if it’s a 1.0 or not?! My code most definitely doesn’t ever check the values of gmm.var and treats all of them equally…

I tried to manually set all variances to 1.0 in the code (var .= 1.0 at the end of M_variances!), but that didn’t cause an immediate slowdown. Let me check parameters other than var…

One of the gmm.ps becomes very close to zero as soon as one of gmm.vars becomes one:

(gmm.p, gmm.mu, gmm.var, timing.time) = ([0.8741740560593829, 0.12582594394061702], [-0.12445426039294907, 0.5249887923292634], [0.2662679258567789, 0.19150053955326957], 0.44847282)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([0.8648908241877292, 0.13510917581227083], [-0.13544645815557219, 0.5379089378899188], [0.2629225941275259, 0.18116854470156368], 0.450924207)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([0.865731986380992, 0.13426801361900803], [-0.13610413303321459, 0.5472323395395478], [0.26214244460585273, 0.17728904458942218], 0.475586434)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([0.8652912440657172, 0.13470875593428275], [-0.1381471970307616, 0.5559140481961833], [0.26069692251129745, 0.17328849490587808], 0.469968482)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 1.2061982368350508e-211], [-0.04850820421805095, 0.006681728278170384], [0.29913480492029554, 1.0], 0.440431988)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.4e-323], [-0.04626372621010133, 0.03450195760521329], [0.2998579815176012, 1.0], 1.330318503)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.4e-323], [-0.04699553272949396, 0.03458113888553481], [0.3004979961705274, 1.0], 1.692210473)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.4e-323], [-0.045486202307925855, 0.03131468434729816], [0.3017003699016324, 1.0], 1.702397646)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.4e-323], [-0.04326066544042423, 0.027498802822819322], [0.30120562536428414, 1.0], 1.709937289)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.4e-323], [-0.044049095734514813, 0.029713008034894354], [0.3014536469939427, 1.0], 1.687663125)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.4e-323], [-0.0381826364743794, 0.023526517815407426], [0.3044833045360511, 1.0], 1.72919904)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.03450874044454262, -0.10279512414347246], [0.30415011758494825, 1.0], 1.698203259)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.03227144734221198, -0.10583507664550244], [0.3039567622674634, 1.0], 1.698433758)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.03049081412388554, -0.10892458265410389], [0.3040352748957333, 1.0], 1.791548164)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.030614532277827552, -0.10904109434424808], [0.3041718322931663, 1.0], 1.698832146)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.02850837278400225, -0.11572266780306648], [0.3036123236862732, 1.0], 1.687605552)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.027267131004442623, -0.11711481647082765], [0.3018321975190529, 1.0], 1.723543712)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.02575499747282709, -0.12420180532082362], [0.3011188102006153, 1.0], 1.698879028)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.03014875432063738, -0.11262216693610323], [0.3034159553051776, 1.0], 1.688220046)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.0320080543425126, -0.11369891044187358], [0.3062028200798609, 1.0], 1.721568633)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.03184012338729602, -0.11359358789156354], [0.3061830104269247, 1.0], 1.6933472)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.02829610826409637, -0.12348337442514049], [0.3042153151797508, 1.0], 1.689155822)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.02521647442410843, -0.10903874464916333], [0.30873714160720517, 1.0], 1.726852347)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.027064774974067913, -0.10587154949003763], [0.3102281299289041, 1.0], 1.699564989)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.02916858144834629, -0.10441993531115061], [0.30781217250225834, 1.0], 1.689730687)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.030086767918777024, -0.1032263136922124], [0.3084804939882712, 1.0], 1.721201442)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.03275468341662056, -0.09736270656082882], [0.3096173941531654, 1.0], 1.72068594)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.032912918039889286, -0.09746219043155313], [0.3096099546780613, 1.0], 1.877142411)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.03103516894584259, -0.09818507031722594], [0.3094866765138834, 1.0], 1.75439523)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.03232002393893344, -0.10033161587270507], [0.30804567032480773, 1.0], 1.68914445)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.034857371398213495, -0.09420142395036593], [0.3063997050993591, 1.0], 1.72629083)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.03499666838440872, -0.09416240415179211], [0.30621467867612306, 1.0], 1.693519697)

mikkoku · February 24, 2022, 1:47pm

Some numbers are more difficult than others. For example the 5.0e-323 you observed. That number is so small that if you divide it by 10 you get zero. If I understand correctly that small numbers are called subnormal and they are slower to compute with.

ForceBru · February 24, 2022, 2:11pm

Looks like this is the problem. In particular, calculating logarithms is very slow for such small numbers:

julia> mix.G
2×300 Matrix{Float64}:
 2.0e-323  1.5e-323  3.0e-323  1.5e-323  1.0e-323  1.5e-323  1.5e-323  1.5e-323  1.5e-323  2.5e-323  1.5e-323  …  1.5e-323  1.5e-323  1.5e-323  1.5e-323  5.0e-323  1.5e-323  2.0e-323  2.0e-323  1.5e-323  2.5e-323
 1.0       1.0       1.0       1.0       1.0       1.0       1.0       1.0       1.0       1.0       1.0          1.0       1.0       1.0       1.0       1.0       1.0       1.0       1.0       1.0       1.0

So the first row contains these tiny numbers, while the second one is fine. Computing logarithms of these rows shows a 4.6 times slowdown:

julia> @benchmark $log.($mix.G[1, :])
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  20.047 μs … 63.839 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     20.318 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   20.694 μs ±  1.546 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆█▆▅ ▃▃▂     ▄▃▁                                            ▂
  ████████▄▁▃▇█████▇▆▆▅▆▄▆▇▆▅▅▅▆▅▁▃▁▁▃▁▁▁▃▁▃▁▃▁▁▁▁▁▁▁▃▁▁▅▆▇▆▆ █
  20 μs        Histogram: log(frequency) by time      29.2 μs <

 Memory estimate: 5.00 KiB, allocs estimate: 2.

julia> @benchmark $log.($mix.G[2, :])
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.481 μs …  3.587 ms  ┊ GC (min … max):  0.00% … 99.73%
 Time  (median):     4.348 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   6.097 μs ± 66.041 μs  ┊ GC (mean ± σ):  21.60% ±  1.99%

  ▄▃▃█▇▆▄ ▂▆▂▂▂             ▃▆▇▆▆▄▃▂▁▁▂▂▃▂▂▁ ▁               ▂
  █████████████▆▄▆▅▄▆▇▇▇▇▇▇▇█████████████████████▇▇▆▆▆▇▇▆▇▆▇ █
  3.48 μs      Histogram: log(frequency) by time     7.51 μs <

 Memory estimate: 5.00 KiB, allocs estimate: 2.

julia> 20/4.348
4.599816007359705

On the other hand, I’m actually computing log.(gmm.G .+ 1e-100), and benchmarking that doesn’t show any differences between the two rows…

Division is also really slow:

julia> @benchmark $mix.G[2, :] ./ $mix.norm
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.327 μs …  2.752 ms  ┊ GC (min … max):  0.00% … 99.81%
 Time  (median):     1.631 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   2.833 μs ± 48.623 μs  ┊ GC (mean ± σ):  34.15% ±  2.00%

  ▄▅▃▁▃██▅▄▄▁                          ▁▂▂▂▂▂▂▁              ▂
  ████████████▇▆▄▄▄▄▅▁▄▄▇▇▇▇▇▆▄▃▃▃▅▅▃▅████████████▇▇▇▇▇▆▇▇▆▇ █
  1.33 μs      Histogram: log(frequency) by time     4.24 μs <

 Memory estimate: 5.03 KiB, allocs estimate: 3.

julia> @benchmark $mix.G[1, :] ./ $mix.norm
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):  6.394 μs …  4.395 ms  ┊ GC (min … max):  0.00% … 99.69%
 Time  (median):     6.740 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   7.908 μs ± 61.086 μs  ┊ GC (mean ± σ):  10.91% ±  1.41%

  ▂▄▂▁▇█▄▃▅▃ ▂▂▁ ▃▃             ▁▂▂▂▂▂▁▁▁▁                   ▂
  ██████████▆███████▃▄▆▃▃▅▄▃▁▄▃▇█████████████▇▇█▇▅▃▅▆▇▅▅▆▅▅▅ █
  6.39 μs      Histogram: log(frequency) by time     10.1 μs <

 Memory estimate: 5.03 KiB, allocs estimate: 3.

I’m dividing here:

function finalize_posteriors!(G::AM{Float64}, norm::AV{Float64}, reg::Nothing)
    _K, N = size(G)
    for n ∈ 1:N, k ∈ 1:_K
        G[k, n] /= norm[n]
    end
end

lmiq · February 24, 2022, 2:26pm

Yes, exactly, you have a combination of problems: first, the operations can be much slower for some combinations of numbers. Second, you effectively converged to those numbers, where those operations are slow.

I think you should, in a practical scenario, detect some sort of convergence and stop.

ForceBru · February 24, 2022, 3:13pm

What a rabbit hole! Well, looks like the issue has finally been found. @lmiq, @goerch and @mikkoku, thank you very much for debugging this, I really appreciate it!

stillyslalom · February 24, 2022, 4:52pm

Topic		Replies	Views
Parallelizing for loop in the computation of a gradient Performance question	19	2632	February 26, 2018
Fluctuations when measuring execution time of linear algebra code Performance	12	1067	July 19, 2019
How much is it normal that @time differs in time? New to Julia	20	2441	June 28, 2017
LoopVectorization: @turbo performs worse than @inbounds on trivial loop New to Julia question , simd , loopvectorization	9	2154	August 28, 2021
Boltzmann Factor in a loop Performance	38	1906	April 3, 2018

Same code run multiple times gives wildly different timings

Related topics