Fitting a 1D distribution using Gaussian Mixtures

tamasgal · August 28, 2019, 3:01pm

I struggle to fit my data using Gaussian mixtures with Julia. I found the package GitHub - davidavdav/GaussianMixtures.jl: Large scale Gaussian Mixture Models which seems to work similarly to the scikit-stuff I used before but the actual fit gets stuck and every mixture component has the same parameters. I thought that’s an easy starting exercise but apparently I don’t grasp it.

Here is what I have:

using GaussianMixtures
using PGFPlotsX
using StatsBase

gmm = GMM(2, 1)  # two components, one dimension

# dummy data
d₁ = randn(20000, 1) .- 5
d₂ = randn(100000, 1) .+ 23
d₊ = vcat(d₁, d₂)  # 120000×1 Array{Float64,2}

This is how the distribution looks like:

h = fit(Histogram, d₊[:,1], nbins=40)
fig = @pgf Axis(Plot({"ybar interval"}, Coordinates(h.edges[1][2:end], h.weights)))

distribution

Running the fit should be easy:

result = em!(gmm, d₊)

however, this spits out the following log, indicating that the optimisation got stuck after the first iteration:

┌ Info: Running 10 iterations EM on diag cov GMM with 2 Gaussians in 1 dimensions
└ @ GaussianMixtures /home/tgal/.julia/packages/GaussianMixtures/RGtTJ/src/train.jl:242
┌ Info: iteration 1, average log likelihood -223.922195
└ @ GaussianMixtures /home/tgal/.julia/packages/GaussianMixtures/RGtTJ/src/gmms.jl:71
┌ Info: iteration 2, average log likelihood -3.768908
└ @ GaussianMixtures /home/tgal/.julia/packages/GaussianMixtures/RGtTJ/src/gmms.jl:71
┌ Info: iteration 3, average log likelihood -3.768908
└ @ GaussianMixtures /home/tgal/.julia/packages/GaussianMixtures/RGtTJ/src/gmms.jl:71
┌ Info: iteration 4, average log likelihood -3.768908
└ @ GaussianMixtures /home/tgal/.julia/packages/GaussianMixtures/RGtTJ/src/gmms.jl:71
┌ Info: iteration 5, average log likelihood -3.768908
└ @ GaussianMixtures /home/tgal/.julia/packages/GaussianMixtures/RGtTJ/src/gmms.jl:71
┌ Info: iteration 6, average log likelihood -3.768908
└ @ GaussianMixtures /home/tgal/.julia/packages/GaussianMixtures/RGtTJ/src/gmms.jl:71
┌ Info: iteration 7, average log likelihood -3.768908
└ @ GaussianMixtures /home/tgal/.julia/packages/GaussianMixtures/RGtTJ/src/gmms.jl:71
┌ Info: iteration 8, average log likelihood -3.768908
└ @ GaussianMixtures /home/tgal/.julia/packages/GaussianMixtures/RGtTJ/src/gmms.jl:71
┌ Info: iteration 9, average log likelihood -3.768908
└ @ GaussianMixtures /home/tgal/.julia/packages/GaussianMixtures/RGtTJ/src/gmms.jl:71
┌ Info: iteration 10, average log likelihood -3.768908
└ @ GaussianMixtures /home/tgal/.julia/packages/GaussianMixtures/RGtTJ/src/gmms.jl:71
┌ Info: EM with 120000 data points 10 iterations avll -3.768908
│ 24000.0 data points per parameter
└ @ GaussianMixtures /home/tgal/.julia/packages/GaussianMixtures/RGtTJ/src/gmms.jl:71

Looking at the \mu parameter, both Gauss components have the same value:

> gmm.μ
2×1 Array{Float64,2}:
 18.332104281524813
 18.332104281524813

The same holds if I try with more components…

So what is going wrong here?

uwbanjoman · August 28, 2019, 3:36pm

must be something in your setup
my output is like this:
julia> result = em!(gmm, d₊)
[ Info: Running 10 iterations EM on diag cov GMM with 2 Gaussians in 1 dimensions
[ Info: iteration 1, average log likelihood -223.973397
[ Info: iteration 2, average log likelihood -3.768754
[ Info: iteration 3, average log likelihood -3.768754
[ Info: iteration 4, average log likelihood -3.768754
[ Info: iteration 5, average log likelihood -3.768754
[ Info: iteration 6, average log likelihood -3.768754
[ Info: iteration 7, average log likelihood -3.768754
[ Info: iteration 8, average log likelihood -3.768754
[ Info: iteration 9, average log likelihood -3.768754
[ Info: iteration 10, average log likelihood -3.768754
┌ Info: EM with 120000 data points 10 iterations avll -3.768754
└ 24000.0 data points per parameter
10-element Array{Float64,1}:
-223.97339691979332
-3.768753993438004
-3.768753993438004
-3.768753993438004
-3.768753993438004
-3.768753993438004
-3.768753993438004
-3.768753993438004
-3.768753993438004
-3.768753993438004

tamasgal · August 28, 2019, 3:37pm

Uhm, what do you mean? The output seems the same as mine… and I guess the gmm.\mu is also the same.

uwbanjoman · August 28, 2019, 3:44pm

What I mean is that on my PC i don’t get the logfile spitted out. I thought that that was your question / problem.

and yes
julia> gmm.μ
2×1 Array{Float64,2}:
18.335820572982005
18.335820572982005

tamasgal · August 28, 2019, 3:48pm

Small deviations are fine

tamasgal · August 28, 2019, 4:16pm

I am not sure what else I am doing wrong. I also tried more components with:

d₁ = randn(20000, 1) .- 5
d₂ = randn(100000, 1) .+ 23
d₃ = randn(10000, 1) .- 42
d₄ = randn(150000, 1) .+ 69
d₊ = vcat(d₁, d₂, d₃, d₄);

and get the same behaviour:

> gmm.μ
5×1 Array{Float64,2}:
 43.32143763742449
 43.32143763742449
 43.32143763742449
 43.32143763742449
 43.32143764486415

Anyways, it seems that I simply don’t understand the API.

This one works (if I do it in one-shot):

gmm = GMM(4, d₊)

Which yields

4×1 Array{Float64,2}:
  68.9987907346628 
  22.99872660258784
  -4.98850081738524
 -41.99187152150204

The problem is still that the sigmas and weights do not match…

pistacliffcho · August 28, 2019, 6:55pm

This looks like a pretty significant bug in the Gaussian Mixtures package, but you should have an easy workaround.

From the Git page on how you initialized your model:

Initialize a GMM with n multivariate Gaussians of dimension d . The means are all set to 0 (the origin) and the variances to I , which is silly by itself.

This is problematic: during the E-step, since each component has the exact same mean/variance, every single row will have the exact same probability of coming from each component…which means during the M-step, it will have the exact same estimate for each component parameter…which means during the next E-step, every row will have the same probability of being from each component, etc.

Solution:

This has a very simple solution: just randomly jitter the mean parameters before you start fitting the model, and you will get separation of the data by component, and the EM algorithm should actually work.

This should be included in the initialization of the model, but until that’s fixed you can just do it manually. I have no idea how the structure of the package works, so I don’t know how to do this, but I assume it can’t be that hard.

tamasgal · August 29, 2019, 7:01am

Makes perfect sense, thanks!

Do you happen to know how to get the expectation value for a given X? I only managed to draw random samples with rand(gmm, ...) but when converting to a MixtureModel with mm = MixtureModel(gmm) and then pdf(mm, x) I get values which are not well weighted. It seems that some binning/normalisation is lost during the conversion.

However, I should probably open an issue

pistacliffcho · August 29, 2019, 8:49am

If you mean in general, is there a closed form solution for the mean of a Gaussian mixture model, sure:

\sum_{i = 1}^k p_i \mu_i

If you mean how does one extract that from the GuassianMixture.jl, that I have no idea; I’ve never used it before.

tamasgal · August 29, 2019, 8:51am

Yeah I meant the second one. Alright, time for an issue. Thanks!

tamasgal · August 29, 2019, 8:57am

Btw. do you know any other Julia package which can do GMM fitting?

mrVeng · August 29, 2019, 9:04am

You would need to work in a Bayesian framework but Turing could do that easily for you, see here a tutorial for that:
Turing GMM

tamasgal · August 29, 2019, 9:08am

Wow that looks quite nice. I’ll try the tutorial and translate it to my 1D problem…
In the end all I need is nothing else than the parameters of multiple Gaussians which describe the distribution of the data.

tamasgal · August 29, 2019, 12:09pm

For the sake of completeness, here is the fit and the comparison of the fit to the data using GaussianMixtures.jl:

using GaussianMixtures
using PGFPlotsX
using StatsBase
using Random
Random.seed!(23);

d₁ = randn(20000, 1) .- 5
d₂ = randn(100000, 1) .+ 23
d₊ = vcat(d₁, d₂);

n = length(d₊)

gmm = GMM(2, d₊)

n_bins = 100

h = fit(Histogram, d₊[:,1], nbins=n_bins)
h₀ = fit(Histogram, rand(gmm, length(d₊))[:], nbins=n_bins)

fig = @pgf Axis(
    Plot({"ybar interval"},
        Coordinates(h.edges[1][2:end], h.weights)
    ),
    Plot({color="red"},
        Coordinates(h₀.edges[1][2:end].+(h₀.edges[1].step.hi/2), h₀.weights)
    )
)

gmm_fit

william · April 26, 2020, 2:09pm

@tamasgal were you able to try @pistacliffcho’s solution for fitting the GMM? Did it eventually work?

Agreed that it’s bizarre that all parameters would be initialized to be identical. Random initialization works; ive also seen means initialized to random samples from the training data, or initialized with k-means / Lloyd’s algorithm.

tamasgal · April 26, 2020, 2:21pm

Yes, you can see in my example just above your post how I randomly set the initial values, it works fine!

Edit: ah, my mistake, I did not include that in my example, but I remember that random starting values worked

sairus7 · April 12, 2024, 3:10pm

Sorry for reviving old thread, but looks like as of 2024 none of this is working.
I’ve changed the examples to the following one, notice manual setting of different means, without this em! fits identical gausses:

using StatsBase
using StatsPlots
using GaussianMixtures

d1 = randn(20000, 1) .- 3
d2 = randn(100000, 1) .+ 3
dsum = vcat(d1, d2);
edges = -10:0.2:10

h = fit(Histogram, dsum[:,1], edges)
plot(h, alpha = 0.5)

gmm = GMM(2,1,kind=:full)
gmm.μ .= [-1, 1] # ! without this both gausses become identical
em!(gmm, dsum)

n = length(dsum)

h0 = fit(Histogram, rand(gmm, length(dsum))[:], edges)
plot!(h0, alpha = 0.5)

tamasgal · April 12, 2024, 3:29pm

At least I can contradict that the example above works, just like 4 years ago

I was using these packages and the plot looks the same as in Fitting a 1D distribution using Gaussian Mixtures - #14 by tamasgal

Status `~/tmp/gmm/Project.toml`
  [cc18c42c] GaussianMixtures v0.3.9
  [8314cec4] PGFPlotsX v1.6.1
⌅ [2913bbd2] StatsBase v0.33.21

sairus7 · April 12, 2024, 3:51pm

Indeed, re-run this in a fresh env and this worked. looks like I’ve had polluted namespace from other packages, since

gmm = GMM(2, d₊)

gave me
ERROR: MethodError: no method matching weights(::Vector{Float64})

Topic		Replies	Views
GaussianMixtures question Statistics	8	740	May 21, 2021
Gaussian Mixture Model estimation in Julia 1.0 Statistics	8	2452	July 3, 2019
Turing.jl programmatically set number of clusters in a mixture model Probabilistic Programming question , package , turing	6	185	May 10, 2025
Strange issues fitting GMMs in Turing.jl by extending a fairly simple example Probabilistic Programming turing , bayesian-inference	10	1900	February 15, 2021
Help with Gaussian Mixer Model in Turing.jl Statistics question	0	257	May 1, 2021

Fitting a 1D distribution using Gaussian Mixtures

Related topics