I have a dataset that results from calculating the CpG observed over expected (an estimation of cytosine methylation in the genome) and I am trying to fit the data into a two mean, 1D Gaussian Mixture model using Gaussian Mixtures. Example input data below:
I constructed the gmm with: g = GMM(2,1,kind=:full)
and received this output, which I take as a success
GMM{Float64} with 2 components in 1 dimensions and full covariance
Mix 1: weight 0.500000
mean: [0.0]
covariance: 1×1 Matrix{Float64}:
1.0
Mix 2: weight 0.500000
mean: [0.0]
covariance: 1×1 Matrix{Float64}:
1.0
however when I run the training function: em!(g,x1)
where x1 is the matrix of the above data
I get the following error:
ERROR: Inconsistent size gmm and x
Stacktrace:
[1] error(s::String)
@ Base .\error.jl:33
[2] em!(gmm::GMM{Float64, Vector{LinearAlgebra.UpperTriangular{Float64, Matrix{Float64}}}}, x::Matrix{Any}; nIter::Int64, varfloor::Float64, sparse::Int64, debug::Int64)
@ GaussianMixtures ~.julia\packages\GaussianMixtures\1pQcF\src\train.jl:238
[3] em!(gmm::GMM{Float64, Vector{LinearAlgebra.UpperTriangular{Float64, Matrix{Float64}}}}, x::Matrix{Any})
@ GaussianMixtures ~.julia\packages\GaussianMixtures\1pQcF\src\train.jl:238
[4] top-level scope
@ REPL[16]:1
I didn’t think GaussianMixtures worked with a dataframe. That is the original format of the data but I converted it to matrix format per the type shown on the Github page.
I think this is because em! expects the data to be a 2D Matrix, even if it only has a single column. This works:
using DataFrames, GaussianMixtures
df = DataFrame(label = rand(["a", "b", "c", "d"], 200), x = [randn(50); randn(150).+4])
g = GMM(2,1,kind=:full)
g.μ[1] = 1.0
em!(g, reshape(df.x, :, 1))
Note that I changed one of the means before fitting it on the data–otherwise, both means seem to stay identical to each other and the EM algorithm never splits the data into two groups.
Another way to index a matrix out instead of a 1-D vector is graphdata[:, 2:2]
or if using DataFrame Matrix(df[!,2:2]) also seems to work.
or Matrix(df[!,[:x]])
From the graph you posted, it looks like your mixture isn’t that Gaussian (i.e., values all > 0 and a fairly skewed histogram) so it isn’t surprising that the means are off from where they “should” be. If it’s worth the effort, you might be able to fit a custom non-Gaussian mixture model using Turing or something.
What I am actually looking for is two gaussian peaks, one <=0.5 and and another >1.0. This particular organism may have that as it looks like there is a peak at around 0.3-0.4 but I certainly could be wrong on my visual assessment.