Fitting a 1D distribution using Gaussian Mixtures

I struggle to fit my data using Gaussian mixtures with Julia. I found the package https://github.com/davidavdav/GaussianMixtures.jl which seems to work similarly to the scikit-stuff I used before but the actual fit gets stuck and every mixture component has the same parameters. I thought that’s an easy starting exercise but apparently I don’t grasp it.

Here is what I have:

using GaussianMixtures
using PGFPlotsX
using StatsBase

gmm = GMM(2, 1)  # two components, one dimension

# dummy data
d₁ = randn(20000, 1) .- 5
d₂ = randn(100000, 1) .+ 23
d₊ = vcat(d₁, d₂)  # 120000×1 Array{Float64,2}

This is how the distribution looks like:

h = fit(Histogram, d₊[:,1], nbins=40)
fig = @pgf Axis(Plot({"ybar interval"}, Coordinates(h.edges[1][2:end], h.weights)))

distribution

Running the fit should be easy:

result = em!(gmm, d₊)

however, this spits out the following log, indicating that the optimisation got stuck after the first iteration:

┌ Info: Running 10 iterations EM on diag cov GMM with 2 Gaussians in 1 dimensions
└ @ GaussianMixtures /home/tgal/.julia/packages/GaussianMixtures/RGtTJ/src/train.jl:242
┌ Info: iteration 1, average log likelihood -223.922195
└ @ GaussianMixtures /home/tgal/.julia/packages/GaussianMixtures/RGtTJ/src/gmms.jl:71
┌ Info: iteration 2, average log likelihood -3.768908
└ @ GaussianMixtures /home/tgal/.julia/packages/GaussianMixtures/RGtTJ/src/gmms.jl:71
┌ Info: iteration 3, average log likelihood -3.768908
└ @ GaussianMixtures /home/tgal/.julia/packages/GaussianMixtures/RGtTJ/src/gmms.jl:71
┌ Info: iteration 4, average log likelihood -3.768908
└ @ GaussianMixtures /home/tgal/.julia/packages/GaussianMixtures/RGtTJ/src/gmms.jl:71
┌ Info: iteration 5, average log likelihood -3.768908
└ @ GaussianMixtures /home/tgal/.julia/packages/GaussianMixtures/RGtTJ/src/gmms.jl:71
┌ Info: iteration 6, average log likelihood -3.768908
└ @ GaussianMixtures /home/tgal/.julia/packages/GaussianMixtures/RGtTJ/src/gmms.jl:71
┌ Info: iteration 7, average log likelihood -3.768908
└ @ GaussianMixtures /home/tgal/.julia/packages/GaussianMixtures/RGtTJ/src/gmms.jl:71
┌ Info: iteration 8, average log likelihood -3.768908
└ @ GaussianMixtures /home/tgal/.julia/packages/GaussianMixtures/RGtTJ/src/gmms.jl:71
┌ Info: iteration 9, average log likelihood -3.768908
└ @ GaussianMixtures /home/tgal/.julia/packages/GaussianMixtures/RGtTJ/src/gmms.jl:71
┌ Info: iteration 10, average log likelihood -3.768908
└ @ GaussianMixtures /home/tgal/.julia/packages/GaussianMixtures/RGtTJ/src/gmms.jl:71
┌ Info: EM with 120000 data points 10 iterations avll -3.768908
│ 24000.0 data points per parameter
└ @ GaussianMixtures /home/tgal/.julia/packages/GaussianMixtures/RGtTJ/src/gmms.jl:71

Looking at the \mu parameter, both Gauss components have the same value:

> gmm.μ
2×1 Array{Float64,2}:
 18.332104281524813
 18.332104281524813

The same holds if I try with more components…

So what is going wrong here?

must be something in your setup
my output is like this:
julia> result = em!(gmm, d₊)
[ Info: Running 10 iterations EM on diag cov GMM with 2 Gaussians in 1 dimensions
[ Info: iteration 1, average log likelihood -223.973397
[ Info: iteration 2, average log likelihood -3.768754
[ Info: iteration 3, average log likelihood -3.768754
[ Info: iteration 4, average log likelihood -3.768754
[ Info: iteration 5, average log likelihood -3.768754
[ Info: iteration 6, average log likelihood -3.768754
[ Info: iteration 7, average log likelihood -3.768754
[ Info: iteration 8, average log likelihood -3.768754
[ Info: iteration 9, average log likelihood -3.768754
[ Info: iteration 10, average log likelihood -3.768754
┌ Info: EM with 120000 data points 10 iterations avll -3.768754
└ 24000.0 data points per parameter
10-element Array{Float64,1}:
-223.97339691979332
-3.768753993438004
-3.768753993438004
-3.768753993438004
-3.768753993438004
-3.768753993438004
-3.768753993438004
-3.768753993438004
-3.768753993438004
-3.768753993438004

Uhm, what do you mean? The output seems the same as mine… and I guess the gmm.\mu is also the same.

What I mean is that on my PC i don’t get the logfile spitted out. I thought that that was your question / problem.

and yes
julia> gmm.μ
2×1 Array{Float64,2}:
18.335820572982005
18.335820572982005

Small deviations are fine :wink:

I am not sure what else I am doing wrong. I also tried more components with:

d₁ = randn(20000, 1) .- 5
d₂ = randn(100000, 1) .+ 23
d₃ = randn(10000, 1) .- 42
d₄ = randn(150000, 1) .+ 69
d₊ = vcat(d₁, d₂, d₃, d₄);

and get the same behaviour:

> gmm.μ
5×1 Array{Float64,2}:
 43.32143763742449
 43.32143763742449
 43.32143763742449
 43.32143763742449
 43.32143764486415

Anyways, it seems that I simply don’t understand the API.

This one works (if I do it in one-shot):

gmm = GMM(4, d₊)

Which yields

4×1 Array{Float64,2}:
  68.9987907346628 
  22.99872660258784
  -4.98850081738524
 -41.99187152150204

The problem is still that the sigmas and weights do not match…

This looks like a pretty significant bug in the Gaussian Mixtures package, but you should have an easy workaround.

From the Git page on how you initialized your model:

Initialize a GMM with n multivariate Gaussians of dimension d . The means are all set to 0 (the origin) and the variances to I , which is silly by itself.

This is problematic: during the E-step, since each component has the exact same mean/variance, every single row will have the exact same probability of coming from each component…which means during the M-step, it will have the exact same estimate for each component parameter…which means during the next E-step, every row will have the same probability of being from each component, etc.

Solution:

This has a very simple solution: just randomly jitter the mean parameters before you start fitting the model, and you will get separation of the data by component, and the EM algorithm should actually work.

This should be included in the initialization of the model, but until that’s fixed you can just do it manually. I have no idea how the structure of the package works, so I don’t know how to do this, but I assume it can’t be that hard.

Makes perfect sense, thanks!

Do you happen to know how to get the expectation value for a given X? I only managed to draw random samples with rand(gmm, ...) but when converting to a MixtureModel with mm = MixtureModel(gmm) and then pdf(mm, x) I get values which are not well weighted. It seems that some binning/normalisation is lost during the conversion.

However, I should probably open an issue :wink:

If you mean in general, is there a closed form solution for the mean of a Gaussian mixture model, sure:

\sum_{i = 1}^k p_i \mu_i

If you mean how does one extract that from the GuassianMixture.jl, that I have no idea; I’ve never used it before.

Yeah I meant the second one. Alright, time for an issue. Thanks!

Btw. do you know any other Julia package which can do GMM fitting?

You would need to work in a Bayesian framework but Turing could do that easily for you, see here a tutorial for that:
Turing GMM

Wow that looks quite nice. I’ll try the tutorial and translate it to my 1D problem…
In the end all I need is nothing else than the parameters of multiple Gaussians which describe the distribution of the data.

1 Like

For the sake of completeness, here is the fit and the comparison of the fit to the data using GaussianMixtures.jl:

using GaussianMixtures
using PGFPlotsX
using StatsBase
using Random
Random.seed!(23);

d₁ = randn(20000, 1) .- 5
d₂ = randn(100000, 1) .+ 23
d₊ = vcat(d₁, d₂);

n = length(d₊)

gmm = GMM(2, d₊)

n_bins = 100

h = fit(Histogram, d₊[:,1], nbins=n_bins)
h₀ = fit(Histogram, rand(gmm, length(d₊))[:], nbins=n_bins)

fig = @pgf Axis(
    Plot({"ybar interval"},
        Coordinates(h.edges[1][2:end], h.weights)
    ),
    Plot({color="red"},
        Coordinates(h₀.edges[1][2:end].+(h₀.edges[1].step.hi/2), h₀.weights)
    )
)

gmm_fit

5 Likes

@tamasgal were you able to try @pistacliffcho’s solution for fitting the GMM? Did it eventually work?

Agreed that it’s bizarre that all parameters would be initialized to be identical. Random initialization works; ive also seen means initialized to random samples from the training data, or initialized with k-means / Lloyd’s algorithm.

Yes, you can see in my example just above your post how I randomly set the initial values, it works fine!

Edit: ah, my mistake, I did not include that in my example, but I remember that random starting values worked :wink:

1 Like