Same code run multiple times gives wildly different timings

Here is a runaway case under profiler

Top entry is:

_turbo_!(::Val{(false, 0, 0, 0, false, 4, 32, 15, 64, 32768, 262144, 12582912, 0x0000000000000005)}, ::Val{(:LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x00000000000000000000000000000021, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.memload, 0x0001, 0x01), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.memload, 0x0002, 0x02), :LoopVectorization, :div_fast, LoopVectorization.OperationStruct(0x00000000000000000000000000000021, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000010002, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0003, 0x00), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.memload, 0x0004, 0x03), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.memload, 0x0005, 0x04), :numericconstant, Symbol(\"###reduction##zero###11###\"), LoopVectorization.OperationStruct(0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.constant, 0x0006, 0x00), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x00000000000000000000000000000021, 0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000300040006, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0006, 0x00), :LoopVectorization, :reduced_add, LoopVectorization.OperationStruct(0x00000000000000000000000000000002, 0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000070005, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0005, 0x00), :LoopVectorization, :setindex!, LoopVectorization.OperationStruct(0x00000000000000000000000000000002, 0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000008, 0x00000000000000000000000000000000, LoopVectorization.memstore, 0x0007, 0x04))}, ::Val{(LoopVectorization.ArrayRefStruct{:G, Symbol(\"##vptr##_G\")}(0x00000000000000000000000000000101, 0x00000000000000000000000000000201, 0x00000000000000000000000000000000, 0x00000000000000000000000000000101), LoopVectorization.ArrayRefStruct{:norm, Symbol(\"##vptr##_norm\")}(0x00000000000000000000000000000001, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000001), LoopVectorization.ArrayRefStruct{:data, Symbol(\"##vptr##_data\")}(0x00000000000000000000000000000001, 0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000001), LoopVectorization.ArrayRefStruct{:mu, Symbol(\"##vptr##_mu\")}(0x00000000000000000000000000000001, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000001))}, ::Val{(0, (), (), (), (), ((6, LoopVectorization.IntOrFloat),), ())}, ::Val{(:n, :k)}, ::Val{Tuple{Tuple{CloseOpenIntervals.CloseOpen{Static.StaticInt{0}, Int64}, CloseOpenIntervals.CloseOpen{Static.StaticInt{0}, Int64}}, Tuple{LayoutPointers.GroupedStridedPointers{NTuple{4, Ptr{Float64}}, (1, 1, 1, 1), (0, 0, 0, 0), ((1, 2), (1,), (1,), (1,)), ((1, 2), (3,), (4,), (5,)), Tuple{Static.StaticInt{8}, Int64, Static.StaticInt{8}, Static.StaticInt{8}, Static.StaticInt{8}}, NTuple{5, Static.StaticInt{0}}}}}}, ::Int64, ::Int64, ::Ptr{Float64}, ::Ptr{Float64}, ::Ptr{Float64}, ::Ptr{Float64}, ::Int64)

C:\Users\Win10\.julia\packages\LoopVectorization\tSQDi\src\reconstruct_loopset.jl

  Total:       14675      14675 (flat, cum) 35.14%
    713        14675      14675           @generated function _turbo_!( 
    714            .          .             ::Val{var"#UNROLL#"}, ::Val{var"#OPS#"}, ::Val{var"#ARF#"}, ::Val{var"#AM#"}, ::Val{var"#LPSYM#"}, ::Val{Tuple{var"#LB#",var"#V#"}}, var"#flattened#var#arguments#"::Vararg{Any,var"#num#vargs#"} 
    715            .          .           ) where {var"#UNROLL#", var"#OPS#", var"#ARF#", var"#AM#", var"#LPSYM#", var"#LB#", var"#V#", var"#num#vargs#"} 
    716            .          .             # 1 + 1 # Irrelevant line you can comment out/in to force recompilation... 
    717            .          .             ls = _turbo_loopset(var"#OPS#", var"#ARF#", var"#AM#", var"#LPSYM#", var"#LB#".parameters, var"#V#".parameters, var"#UNROLL#") 
    718            .          .             pushfirst!(ls.preamble.args, :(var"#lv#tuple#args#" = reassemble_tuple(Tuple{var"#LB#",var"#V#"}, var"#flattened#var#arguments#"))) 

Seems to be triggered from step_M!. This leads me to the question if

	evidences .= 0
    @tturbo for n ∈ 1:N, k ∈ 1:K
        evidences[k] += G[k, n]
	end

is race free? I’d propose to use something like

	evidences .= 0
    @tturbo for k ∈ 1:K
        for n ∈ 1:N
            evidences[k] += G[k, n]
        end
	end

instead, assuming it is equivalent (or better) than Threads.@threads for k ∈ 1:K…

Yeah, the results look fine. Also, it essentially loops maxiter::Integer=10_000 times over the same code. There are no stopping conditions yet or anything that could make iteration 1 different from iteration 100 or whatever.

I tried it but couldn’t find anything new. Let me try again…

I, for example, have no idea, unfortunately. I guess LoopVectorization does some magic to ensure that it’s fine? I’m also seeing uses of atomic-related stuff in my profile results and also some calls to functions called wait, so maybe this is used to ensure synchronization.

A more extreme case could be in the ELBO calculation:

ret = 0.0
@tturbo for n ∈ 1:N, k ∈ 1:K
	q = G[k, n]
	ret += q * (
		log(p[k]) - (log(2Ο€) + log(var[k]) + (data[n] - mu[k])^2 / var[k]) / 2
		- log(q + 1e-100)
	)
end

I don’t know how it can update ret from multiple threads without causing a lot of race conditions. However, the documentation shows that summing over stuff is a valid use-case.

Changed to this and immediately ran into a slowdown… I think LoopVectorization reorders the loops, so the Julia code doesn’t really reflect the actual computations that are being run.

Actually, wait, I changed to the wrong thing…

Nope, same slowdowns anyway. For some reason, the slowdowns look like this:

  1. Everything starts running as fast as the nothing case, nice and smooth
  2. At some point, it sort of trips up and the progress bar starts β€œstruggling”, if you pardon my humanization of computers.
  3. The ETA starts slowly increasing
  4. This continues for about 60% of the progress bar, then the ETA starts decreasing
  5. …aaaand it’s done, after like a minute or two

As do I. So I’d assume the good case can access evidences[k] in parallel and the bad case has to synchronize…

Which has to synchronize access…

They’re literally the same for step_M!, though: it accepts any MaybeReg, and the loop doesn’t depend on this parameter at all. Or, shouldn’t depend.

Let me just remove all differences between the two cases and have all functions run the same code regardless of reg…

EDIT: everything works fine if I ignore regularization:

finalize_variances!(var::AV{Float64}, norm::AV{Float64}, reg::Nothing) = @turbo var .= var ./ norm .+ 1e-6

finalize_variances!(var::AV{Float64}, norm::AV{Float64}, reg::InvGamma) =
    finalize_variances!(var, norm, nothing)

Makes total sense, I guess?

Here is a result replacing @tturbo with @turbo on my 6 cores (12 threads):

This indicates to me that parallelization doesn’t scale well in our case, which in turn could mean we hit a border case for @tturbo.

Sorry, you lost me here…

I mean, I removed all differences between the nothing case and the InvGamma case, and the timings look the same, which makes sense since the two cases became identical.


Tried @turbo (1 CPU core) instead of @tturbo:

  1. Run 1: 00:48 vs 00:49
  2. Run 2: 00:49 vs 01:17
  3. Run 3: 00:48 vs 03:24 (!)

CPU usage was around 100% out of 400% (because I have 4 CPU cores total) and never dropped while running the code.

Confirmed, off to profile @turbo tomorrow;)

Did you try fixing the random seed?

import Random
Random.seed!(123)
# Generate data

It appears that the timings are input dependent.

1 Like

I tried that and couldn’t reproduce the issue after fixing the seed. However, I’m trying to check whether my code produces correct results, so I’m running it on random data on purpose. What’s the point of a parameter estimation algorithm if it only works with one particular data set?

I wanted to output the random seed that caused the slowdown, but couldn’t find a way of extracting it. Random.seed!() sets the seed, but apparently there’s no obvious way of getting the current seed. Why is there no Random.seed() that returns the current seed?

On Julia 1.7:

  • From here: random - Retrieve RNG seed in julia - Stack Overflow
    • Base.Random.RANDOM_SEED throws UndefVarError: Random not defined
    • After import Random: Random.RANDOM_SEED throws UndefVarError: RANDOM_SEED not defined
    • Random.GLOBAL_RNG.seed says type _GLOBAL_RNG has no field seed
    • Random.default_rng().seed says type TaskLocalRNG has no field seed
    • Random.GLOBAL_SEED isn’t updated after I call rand(), so I guess there’s more random state behind the scenes…

You need to save the seed when you set it. You cannot obtain it afterwards. If the slowdown happens often you should be able to find a seed that produces a slowdown by just trying a few seeds. For me, seed!(1) did produce a slowdown, while seed!(2) did not. I agree it is good to test on random data, but I hope this would help locate the problem.

1 Like

Indeed, Random.seed!(1) results in 00:45 for the nothing case and 01:21 for the InvGamma case. Another run shows 00:45 and 01:20.

BTW, this happens even without any involvement of LoopVectorization. I deleted all mentions of @turbo and friends and commented out using LoopVectorization and still got 01:36 vs 02:26 - almost a 1min difference!


Updated version w/o LoopVectorizaztion and lots of @asserts to ensure that everything is of the correct shape: https://gist.github.com/ForceBru/8e7f6940c3013e15f62a6c660a618c85#file-version_2_noloopvec-jl

Some timings:

  1. 01:30 (asserts slow things down) vs 02:20
  2. 01:29 vs 02:20
  3. 01:31 vs 02:20

I noticed the following: the slowdown starts when var assumes a specific value:

Progress:  70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–                            |  ETA: 0:00:02(i, gmm.v
ar) = (142, [0.2454790079230406, 0.16000283547928387])
(i, gmm.var) = (143, [0.2393365282721729, 0.15443620866888444])
(i, gmm.var) = (144, [0.24284254899660437, 0.1619336945853131])
Progress:  72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š                           |  ETA: 0:00:02(i, gmm.v
ar) = (145, [0.29729978746329194, 1.0])
(i, gmm.var) = (146, [0.2972146796343939, 1.0])
(i, gmm.var) = (147, [0.2942993215997909, 1.0])
Progress:  73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž                         |  ETA: 0:00:02(i, gmm.v
ar) = (148, [0.2929915147487294, 1.0])
Progress:  74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹                         |  ETA: 0:00:03(i, gmm.v
ar) = (149, [0.29300358886192024, 1.0])
Progress:  74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–     

There it started when you have the second element of var == 1.0.

Maybe that gives you a hint on what is wrong there.

2 Likes

Wow, indeed, it slows down as soon as there’s a k such that gmm.var[k] == 1.0.

Here I changed the fit! function to output the time it took to run the main loop:

timing = @timed for i ∈ 1:maxiter
	EM.step_E!(gmm.G, gmm.norm, gmm.p, gmm.mu, gmm.var, data, reg)
	EM.step_M!(gmm.G, gmm.norm, gmm.p, gmm.mu, gmm.var, data, reg)

	gmm.ELBO_hist[i] = EM.ELBO(gmm.G, gmm.p, gmm.mu, gmm.var, data, reg)
end

@show (gmm.var, timing.time)

Here’s the output from the InvGamma case:

[ Info: Fitting... (with regularization)
(gmm.var, timing.time) = ([0.12979600519462, 0.22091734277480923], 0.437090557)
(gmm.var, timing.time) = ([0.13081625184968007, 0.2206570705655502], 0.43813817)
(gmm.var, timing.time) = ([0.12994360015250062, 0.2205964244568769], 0.444398958)
(gmm.var, timing.time) = ([0.13205819345301345, 0.22466612385111753], 0.436146891)
(gmm.var, timing.time) = ([0.131606469707795, 0.22649328809623276], 0.439656724)
(gmm.var, timing.time) = ([0.13010806538724098, 0.23327644124826644], 0.437758792)
(gmm.var, timing.time) = ([0.1279562407054046, 0.23367085391900447], 0.437707178)
...
(gmm.var, timing.time) = ([0.24411286011714176, 0.16580655159485652], 0.448527361)
(gmm.var, timing.time) = ([0.2662679258567789, 0.19150053955326957], 0.446357033)
(gmm.var, timing.time) = ([0.2629225941275259, 0.18116854470156368], 0.447261846)
(gmm.var, timing.time) = ([0.26214244460585273, 0.17728904458942218], 0.448244723)
(gmm.var, timing.time) = ([0.26069692251129745, 0.17328849490587808], 0.446762994)
(gmm.var, timing.time) = ([0.29913480492029554, 1.0], 0.434144478)
(gmm.var, timing.time) = ([0.2998579815176012, 1.0], 1.305444)
(gmm.var, timing.time) = ([0.3004979961705274, 1.0], 1.707785367)
(gmm.var, timing.time) = ([0.3017003699016324, 1.0], 1.699075401)
(gmm.var, timing.time) = ([0.30120562536428414, 1.0], 1.688268186)
(gmm.var, timing.time) = ([0.3014536469939427, 1.0], 1.682149487)
(gmm.var, timing.time) = ([0.3044833045360511, 1.0], 1.682936661)
(gmm.var, timing.time) = ([0.30415011758494825, 1.0], 1.718251328)
(gmm.var, timing.time) = ([0.3039567622674634, 1.0], 1.71807702)
(gmm.var, timing.time) = ([0.3040352748957333, 1.0], 1.690801447)
(gmm.var, timing.time) = ([0.3041718322931663, 1.0], 1.684268279)
(gmm.var, timing.time) = ([0.3036123236862732, 1.0], 1.683556643)
(gmm.var, timing.time) = ([0.3018321975190529, 1.0], 1.685763053)
(gmm.var, timing.time) = ([0.3011188102006153, 1.0], 1.689479624)
(gmm.var, timing.time) = ([0.3034159553051776, 1.0], 1.69546642)
(gmm.var, timing.time) = ([0.3062028200798609, 1.0], 1.763214379)
(gmm.var, timing.time) = ([0.3061830104269247, 1.0], 1.915075644)
(gmm.var, timing.time) = ([0.3042153151797508, 1.0], 1.72388072)
(gmm.var, timing.time) = ([0.30873714160720517, 1.0], 1.765052766)
(gmm.var, timing.time) = ([0.3102281299289041, 1.0], 1.819851372)
(gmm.var, timing.time) = ([0.30781217250225834, 1.0], 1.711476321)
(gmm.var, timing.time) = ([0.3084804939882712, 1.0], 1.722908551)
(gmm.var, timing.time) = ([0.3096173941531654, 1.0], 1.749803121)
(gmm.var, timing.time) = ([0.3096099546780613, 1.0], 1.792705749)
(gmm.var, timing.time) = ([0.3094866765138834, 1.0], 1.756672981)
(gmm.var, timing.time) = ([0.30804567032480773, 1.0], 1.841761611)
(gmm.var, timing.time) = ([0.3063997050993591, 1.0], 1.798693229)
(gmm.var, timing.time) = ([0.30621467867612306, 1.0], 1.829438462)
(gmm.var, timing.time) = ([0.3055715213101416, 1.0], 1.732961334)
(gmm.var, timing.time) = ([0.30556030772413, 1.0], 1.732991028)
(gmm.var, timing.time) = ([0.3052487146454984, 1.0], 1.733896484)
(gmm.var, timing.time) = ([0.3045452389182044, 1.0], 1.750909186)
(gmm.var, timing.time) = ([0.3030495221465515, 1.0], 1.718392537)
(gmm.var, timing.time) = ([0.3028247679552457, 1.0], 1.733551232)
(gmm.var, timing.time) = ([0.3028697369563701, 1.0], 1.722950705)
(gmm.var, timing.time) = ([0.3026594808005801, 1.0], 1.725680657)
(gmm.var, timing.time) = ([0.30277800942558725, 1.0], 1.739890083)
(gmm.var, timing.time) = ([0.3039729625602777, 1.0], 1.728341708)
(gmm.var, timing.time) = ([0.3000833827259703, 1.0], 1.711552785)
(gmm.var, timing.time) = ([0.2998545902702496, 1.0], 1.740374686)
(gmm.var, timing.time) = ([0.2985882769517061, 1.0], 1.728837756)
(gmm.var, timing.time) = ([0.3004974524101312, 1.0], 1.725061656)
(gmm.var, timing.time) = ([0.30018938852066274, 1.0], 1.731628271)
(gmm.var, timing.time) = ([0.30075313157351014, 1.0], 1.761483254)

So, after some gmm.var becomes equal to 1.0, the main for i ∈ 1:maxiter loop slows down from 0.44 seconds to 1.7 seconds per the same 10’000 iterations!

WTF, though? Numbers are numbers - who cares if it’s a 1.0 or not?! My code most definitely doesn’t ever check the values of gmm.var and treats all of them equally…


I tried to manually set all variances to 1.0 in the code (var .= 1.0 at the end of M_variances!), but that didn’t cause an immediate slowdown. Let me check parameters other than var…

One of the gmm.ps becomes very close to zero as soon as one of gmm.vars becomes one:

(gmm.p, gmm.mu, gmm.var, timing.time) = ([0.8741740560593829, 0.12582594394061702], [-0.12445426039294907, 0.5249887923292634], [0.2662679258567789, 0.19150053955326957], 0.44847282)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([0.8648908241877292, 0.13510917581227083], [-0.13544645815557219, 0.5379089378899188], [0.2629225941275259, 0.18116854470156368], 0.450924207)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([0.865731986380992, 0.13426801361900803], [-0.13610413303321459, 0.5472323395395478], [0.26214244460585273, 0.17728904458942218], 0.475586434)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([0.8652912440657172, 0.13470875593428275], [-0.1381471970307616, 0.5559140481961833], [0.26069692251129745, 0.17328849490587808], 0.469968482)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 1.2061982368350508e-211], [-0.04850820421805095, 0.006681728278170384], [0.29913480492029554, 1.0], 0.440431988)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.4e-323], [-0.04626372621010133, 0.03450195760521329], [0.2998579815176012, 1.0], 1.330318503)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.4e-323], [-0.04699553272949396, 0.03458113888553481], [0.3004979961705274, 1.0], 1.692210473)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.4e-323], [-0.045486202307925855, 0.03131468434729816], [0.3017003699016324, 1.0], 1.702397646)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.4e-323], [-0.04326066544042423, 0.027498802822819322], [0.30120562536428414, 1.0], 1.709937289)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.4e-323], [-0.044049095734514813, 0.029713008034894354], [0.3014536469939427, 1.0], 1.687663125)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.4e-323], [-0.0381826364743794, 0.023526517815407426], [0.3044833045360511, 1.0], 1.72919904)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.03450874044454262, -0.10279512414347246], [0.30415011758494825, 1.0], 1.698203259)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.03227144734221198, -0.10583507664550244], [0.3039567622674634, 1.0], 1.698433758)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.03049081412388554, -0.10892458265410389], [0.3040352748957333, 1.0], 1.791548164)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.030614532277827552, -0.10904109434424808], [0.3041718322931663, 1.0], 1.698832146)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.02850837278400225, -0.11572266780306648], [0.3036123236862732, 1.0], 1.687605552)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.027267131004442623, -0.11711481647082765], [0.3018321975190529, 1.0], 1.723543712)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.02575499747282709, -0.12420180532082362], [0.3011188102006153, 1.0], 1.698879028)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.03014875432063738, -0.11262216693610323], [0.3034159553051776, 1.0], 1.688220046)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.0320080543425126, -0.11369891044187358], [0.3062028200798609, 1.0], 1.721568633)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.03184012338729602, -0.11359358789156354], [0.3061830104269247, 1.0], 1.6933472)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.02829610826409637, -0.12348337442514049], [0.3042153151797508, 1.0], 1.689155822)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.02521647442410843, -0.10903874464916333], [0.30873714160720517, 1.0], 1.726852347)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.027064774974067913, -0.10587154949003763], [0.3102281299289041, 1.0], 1.699564989)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.02916858144834629, -0.10441993531115061], [0.30781217250225834, 1.0], 1.689730687)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.030086767918777024, -0.1032263136922124], [0.3084804939882712, 1.0], 1.721201442)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.03275468341662056, -0.09736270656082882], [0.3096173941531654, 1.0], 1.72068594)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.032912918039889286, -0.09746219043155313], [0.3096099546780613, 1.0], 1.877142411)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.03103516894584259, -0.09818507031722594], [0.3094866765138834, 1.0], 1.75439523)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.03232002393893344, -0.10033161587270507], [0.30804567032480773, 1.0], 1.68914445)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.034857371398213495, -0.09420142395036593], [0.3063997050993591, 1.0], 1.72629083)
(gmm.p, gmm.mu, gmm.var, timing.time) = ([1.0, 5.0e-323], [-0.03499666838440872, -0.09416240415179211], [0.30621467867612306, 1.0], 1.693519697)
1 Like

Some numbers are more difficult than others. For example the 5.0e-323 you observed. That number is so small that if you divide it by 10 you get zero. If I understand correctly that small numbers are called subnormal and they are slower to compute with.

2 Likes

Looks like this is the problem. In particular, calculating logarithms is very slow for such small numbers:

julia> mix.G
2Γ—300 Matrix{Float64}:
 2.0e-323  1.5e-323  3.0e-323  1.5e-323  1.0e-323  1.5e-323  1.5e-323  1.5e-323  1.5e-323  2.5e-323  1.5e-323  …  1.5e-323  1.5e-323  1.5e-323  1.5e-323  5.0e-323  1.5e-323  2.0e-323  2.0e-323  1.5e-323  2.5e-323
 1.0       1.0       1.0       1.0       1.0       1.0       1.0       1.0       1.0       1.0       1.0          1.0       1.0       1.0       1.0       1.0       1.0       1.0       1.0       1.0       1.0

So the first row contains these tiny numbers, while the second one is fine. Computing logarithms of these rows shows a 4.6 times slowdown:

julia> @benchmark $log.($mix.G[1, :])
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  20.047 ΞΌs … 63.839 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     20.318 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   20.694 ΞΌs Β±  1.546 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–†β–ˆβ–†β–… β–ƒβ–ƒβ–‚     ▄▃▁                                            β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–„β–β–ƒβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–†β–…β–†β–„β–†β–‡β–†β–…β–…β–…β–†β–…β–β–ƒβ–β–β–ƒβ–β–β–β–ƒβ–β–ƒβ–β–ƒβ–β–β–β–β–β–β–β–ƒβ–β–β–…β–†β–‡β–†β–† β–ˆ
  20 ΞΌs        Histogram: log(frequency) by time      29.2 ΞΌs <

 Memory estimate: 5.00 KiB, allocs estimate: 2.

julia> @benchmark $log.($mix.G[2, :])
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.481 ΞΌs …  3.587 ms  β”Š GC (min … max):  0.00% … 99.73%
 Time  (median):     4.348 ΞΌs              β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):   6.097 ΞΌs Β± 66.041 ΞΌs  β”Š GC (mean Β± Οƒ):  21.60% Β±  1.99%

  β–„β–ƒβ–ƒβ–ˆβ–‡β–†β–„ β–‚β–†β–‚β–‚β–‚             ▃▆▇▆▆▄▃▂▁▁▂▂▃▂▂▁ ▁               β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–„β–†β–…β–„β–†β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–†β–†β–‡β–‡β–†β–‡β–†β–‡ β–ˆ
  3.48 ΞΌs      Histogram: log(frequency) by time     7.51 ΞΌs <

 Memory estimate: 5.00 KiB, allocs estimate: 2.

julia> 20/4.348
4.599816007359705

On the other hand, I’m actually computing log.(gmm.G .+ 1e-100), and benchmarking that doesn’t show any differences between the two rows…


Division is also really slow:

julia> @benchmark $mix.G[2, :] ./ $mix.norm
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.327 ΞΌs …  2.752 ms  β”Š GC (min … max):  0.00% … 99.81%
 Time  (median):     1.631 ΞΌs              β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):   2.833 ΞΌs Β± 48.623 ΞΌs  β”Š GC (mean Β± Οƒ):  34.15% Β±  2.00%

  β–„β–…β–ƒβ–β–ƒβ–ˆβ–ˆβ–…β–„β–„β–                          ▁▂▂▂▂▂▂▁              β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–„β–„β–„β–„β–…β–β–„β–„β–‡β–‡β–‡β–‡β–‡β–†β–„β–ƒβ–ƒβ–ƒβ–…β–…β–ƒβ–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–‡β–‡β–†β–‡β–‡β–†β–‡ β–ˆ
  1.33 ΞΌs      Histogram: log(frequency) by time     4.24 ΞΌs <

 Memory estimate: 5.03 KiB, allocs estimate: 3.

julia> @benchmark $mix.G[1, :] ./ $mix.norm
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):  6.394 ΞΌs …  4.395 ms  β”Š GC (min … max):  0.00% … 99.69%
 Time  (median):     6.740 ΞΌs              β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):   7.908 ΞΌs Β± 61.086 ΞΌs  β”Š GC (mean Β± Οƒ):  10.91% Β±  1.41%

  β–‚β–„β–‚β–β–‡β–ˆβ–„β–ƒβ–…β–ƒ ▂▂▁ β–ƒβ–ƒ             ▁▂▂▂▂▂▁▁▁▁                   β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ƒβ–„β–†β–ƒβ–ƒβ–…β–„β–ƒβ–β–„β–ƒβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–ˆβ–‡β–…β–ƒβ–…β–†β–‡β–…β–…β–†β–…β–…β–… β–ˆ
  6.39 ΞΌs      Histogram: log(frequency) by time     10.1 ΞΌs <

 Memory estimate: 5.03 KiB, allocs estimate: 3.

I’m dividing here:

function finalize_posteriors!(G::AM{Float64}, norm::AV{Float64}, reg::Nothing)
    _K, N = size(G)
    for n ∈ 1:N, k ∈ 1:_K
        G[k, n] /= norm[n]
    end
end
1 Like

Yes, exactly, you have a combination of problems: first, the operations can be much slower for some combinations of numbers. Second, you effectively converged to those numbers, where those operations are slow.

I think you should, in a practical scenario, detect some sort of convergence and stop.

What a rabbit hole! Well, looks like the issue has finally been found. @lmiq, @goerch and @mikkoku, thank you very much for debugging this, I really appreciate it!

1 Like

See also set_zero_subnormals:

julia> z = fill(1.5e-323, 1024);

julia> @btime log.($z);
  6.700 ΞΌs (1 allocation: 8.12 KiB)

julia> set_zero_subnormals(true)
true

julia> @btime log.($z);
  1.830 ΞΌs (1 allocation: 8.12 KiB)
2 Likes