GaussianMixtures question

SergeantMike67 · May 20, 2021, 3:59pm

I have a dataset that results from calculating the CpG observed over expected (an estimation of cytosine methylation in the genome) and I am trying to fit the data into a two mean, 1D Gaussian Mixture model using Gaussian Mixtures. Example input data below:

740736×2 Matrix{Any}:
“RDRX01000001” 0.251146
“RDRX01000002” 0.279088
“RDRX01000003” 0.428743
“RDRX01000004” 0.327771
“RDRX01000005” 0.350942
“RDRX01000006” 0.90064
“RDRX01000007” 0.267866
“RDRX01000008” 0.360507
“RDRX01000009” 0.600434
⋮
“RDRX01799874” 1.40488
“RDRX01799875” 1.65668
“RDRX01799876” 1.66154
“RDRX01799877” 1.04891
“RDRX01799878” 0.987179
“RDRX01799879” 1.29231
“RDRX01799880” 1.72998
“RDRX01799881” 1.08387

The histogram looks like:

I constructed the gmm with:
g = GMM(2,1,kind=:full)

and received this output, which I take as a success
GMM{Float64} with 2 components in 1 dimensions and full covariance
Mix 1: weight 0.500000
mean: [0.0]
covariance: 1×1 Matrix{Float64}:
1.0
Mix 2: weight 0.500000
mean: [0.0]
covariance: 1×1 Matrix{Float64}:
1.0

however when I run the training function:
em!(g,x1)

where x1 is the matrix of the above data

I get the following error:
ERROR: Inconsistent size gmm and x
Stacktrace:
[1] error(s::String)
@ Base .\error.jl:33
[2] em!(gmm::GMM{Float64, Vector{LinearAlgebra.UpperTriangular{Float64, Matrix{Float64}}}}, x::Matrix{Any}; nIter::Int64, varfloor::Float64, sparse::Int64, debug::Int64)
@ GaussianMixtures ~.julia\packages\GaussianMixtures\1pQcF\src\train.jl:238
[3] em!(gmm::GMM{Float64, Vector{LinearAlgebra.UpperTriangular{Float64, Matrix{Float64}}}}, x::Matrix{Any})
@ GaussianMixtures ~.julia\packages\GaussianMixtures\1pQcF\src\train.jl:238
[4] top-level scope
@ REPL[16]:1

What am I missing here?

ElOceanografo · May 20, 2021, 4:28pm

It looks like your input matrix is n x 2, and the first column is strings. Try it with em!(g, x1[:, 2]).

As an aside, if your data x1 include both labels and continuous values, they should probably be stored in a DataFrame, not a matrix…

SergeantMike67 · May 20, 2021, 4:51pm

I didn’t think GaussianMixtures worked with a dataframe. That is the original format of the data but I converted it to matrix format per the type shown on the Github page.

SergeantMike67 · May 20, 2021, 5:02pm

ElOceanografo,
Thank you for the response but when I run the
em! (g, x1[:, 2])
or
em! (g, graphdata[:, 2])
where graphdata is the original DataFrame,

I get this error

em!(g, x1[:, 2])
ERROR: MethodError: no method matching em!(::GMM{Float64, Vector{LinearAlgebra.UpperTriangular{Float64, Matrix{Float64}}}}, ::Vector{Any})
Closest candidates are:
em!(::GMM, ::DataOrMatrix{T} where T; nIter, varfloor, sparse, debug) at C:\Users\Mikey.julia\packages\GaussianMixtures\1pQcF\src\train.jl:237
em!(::VGMM, ::Any; nIter) at C:\Users\Mikey.julia\packages\GaussianMixtures\1pQcF\src\bayes.jl:273
Stacktrace:
[1] top-level scope
@ REPL[34]:1

ElOceanografo · May 20, 2021, 10:16pm

I think this is because em! expects the data to be a 2D Matrix, even if it only has a single column. This works:

using DataFrames, GaussianMixtures

df = DataFrame(label = rand(["a", "b", "c", "d"], 200),  x = [randn(50); randn(150).+4])
g = GMM(2,1,kind=:full)
g.μ[1] = 1.0
em!(g, reshape(df.x, :, 1))

Note that I changed one of the means before fitting it on the data–otherwise, both means seem to stay identical to each other and the EM algorithm never splits the data into two groups.

cchderrick · May 20, 2021, 10:28pm

Another way to index a matrix out instead of a 1-D vector is graphdata[:, 2:2]
or if using DataFrame Matrix(df[!,2:2]) also seems to work.
or Matrix(df[!,[:x]])

SergeantMike67 · May 21, 2021, 2:02pm

Thank you ElOceanographo, that worked. The calculated means seem to be off, but I that isn’t likely from the package.

ElOceanografo · May 21, 2021, 6:06pm

From the graph you posted, it looks like your mixture isn’t that Gaussian (i.e., values all > 0 and a fairly skewed histogram) so it isn’t surprising that the means are off from where they “should” be. If it’s worth the effort, you might be able to fit a custom non-Gaussian mixture model using Turing or something.

SergeantMike67 · May 21, 2021, 7:11pm

What I am actually looking for is two gaussian peaks, one <=0.5 and and another >1.0. This particular organism may have that as it looks like there is a peak at around 0.3-0.4 but I certainly could be wrong on my visual assessment.

Topic		Replies	Views
Fitting a 1D distribution using Gaussian Mixtures General Usage statistics	18	3756	April 12, 2024
I made a EM Algorithm for Gaussian Mixture Models, but that is not working General Usage question	4	834	February 17, 2021
Gaussian Mixture Model estimation in Julia 1.0 Statistics	8	2449	July 3, 2019
Turing.jl programmatically set number of clusters in a mixture model Probabilistic Programming question , package , turing	6	185	May 10, 2025
Strange issues fitting GMMs in Turing.jl by extending a fairly simple example Probabilistic Programming turing , bayesian-inference	10	1900	February 15, 2021

GaussianMixtures question

Related topics