GaussianMixtures question

I have a dataset that results from calculating the CpG observed over expected (an estimation of cytosine methylation in the genome) and I am trying to fit the data into a two mean, 1D Gaussian Mixture model using Gaussian Mixtures. Example input data below:

740736×2 Matrix{Any}:
“RDRX01000001” 0.251146
“RDRX01000002” 0.279088
“RDRX01000003” 0.428743
“RDRX01000004” 0.327771
“RDRX01000005” 0.350942
“RDRX01000006” 0.90064
“RDRX01000007” 0.267866
“RDRX01000008” 0.360507
“RDRX01000009” 0.600434

“RDRX01799874” 1.40488
“RDRX01799875” 1.65668
“RDRX01799876” 1.66154
“RDRX01799877” 1.04891
“RDRX01799878” 0.987179
“RDRX01799879” 1.29231
“RDRX01799880” 1.72998
“RDRX01799881” 1.08387

The histogram looks like:

I constructed the gmm with:
g = GMM(2,1,kind=:full)

and received this output, which I take as a success
GMM{Float64} with 2 components in 1 dimensions and full covariance
Mix 1: weight 0.500000
mean: [0.0]
covariance: 1×1 Matrix{Float64}:
Mix 2: weight 0.500000
mean: [0.0]
covariance: 1×1 Matrix{Float64}:

however when I run the training function:

where x1 is the matrix of the above data

I get the following error:
ERROR: Inconsistent size gmm and x
[1] error(s::String)
@ Base .\error.jl:33
[2] em!(gmm::GMM{Float64, Vector{LinearAlgebra.UpperTriangular{Float64, Matrix{Float64}}}}, x::Matrix{Any}; nIter::Int64, varfloor::Float64, sparse::Int64, debug::Int64)
@ GaussianMixtures ~.julia\packages\GaussianMixtures\1pQcF\src\train.jl:238
[3] em!(gmm::GMM{Float64, Vector{LinearAlgebra.UpperTriangular{Float64, Matrix{Float64}}}}, x::Matrix{Any})
@ GaussianMixtures ~.julia\packages\GaussianMixtures\1pQcF\src\train.jl:238
[4] top-level scope
@ REPL[16]:1

What am I missing here?

It looks like your input matrix is n x 2, and the first column is strings. Try it with em!(g, x1[:, 2]).

As an aside, if your data x1 include both labels and continuous values, they should probably be stored in a DataFrame, not a matrix…

I didn’t think GaussianMixtures worked with a dataframe. That is the original format of the data but I converted it to matrix format per the type shown on the Github page.

Thank you for the response but when I run the
em! (g, x1[:, 2])
em! (g, graphdata[:, 2])
where graphdata is the original DataFrame,

I get this error

em!(g, x1[:, 2])
ERROR: MethodError: no method matching em!(::GMM{Float64, Vector{LinearAlgebra.UpperTriangular{Float64, Matrix{Float64}}}}, ::Vector{Any})
Closest candidates are:
em!(::GMM, ::DataOrMatrix{T} where T; nIter, varfloor, sparse, debug) at C:\Users\Mikey.julia\packages\GaussianMixtures\1pQcF\src\train.jl:237
em!(::VGMM, ::Any; nIter) at C:\Users\Mikey.julia\packages\GaussianMixtures\1pQcF\src\bayes.jl:273
[1] top-level scope
@ REPL[34]:1

I think this is because em! expects the data to be a 2D Matrix, even if it only has a single column. This works:

using DataFrames, GaussianMixtures

df = DataFrame(label = rand(["a", "b", "c", "d"], 200),  x = [randn(50); randn(150).+4])
g = GMM(2,1,kind=:full)
g.μ[1] = 1.0
em!(g, reshape(df.x, :, 1))

Note that I changed one of the means before fitting it on the data–otherwise, both means seem to stay identical to each other and the EM algorithm never splits the data into two groups.

Another way to index a matrix out instead of a 1-D vector is graphdata[:, 2:2]
or if using DataFrame Matrix(df[!,2:2]) also seems to work.
or Matrix(df[!,[:x]])

Thank you ElOceanographo, that worked. The calculated means seem to be off, but I that isn’t likely from the package.

From the graph you posted, it looks like your mixture isn’t that Gaussian (i.e., values all > 0 and a fairly skewed histogram) so it isn’t surprising that the means are off from where they “should” be. If it’s worth the effort, you might be able to fit a custom non-Gaussian mixture model using Turing or something.

What I am actually looking for is two gaussian peaks, one <=0.5 and and another >1.0. This particular organism may have that as it looks like there is a peak at around 0.3-0.4 but I certainly could be wrong on my visual assessment.