Problem fitting density function data

daro.slai · April 4, 2023, 8:53pm

I have a dataset from Elaad.
Which looks like this:

the idea is to fit a given Distribution::D to the data. I wanted to try a couple of ideas but the fit needs to handle both the bins and the probability data.
I´ve tried using ´StatsBase.fit()´and ´Distributions.fit()´ but either they don´t handle ´weights´ or don´t have the ´suffstats´ of any of the asymmetric distributions I´ve tried.
My min work example for now is:

using StatsBase, StatsPlots, Distributions, Plots, CSV, DataFrames, LaTeXStrings
arr_times_df = CSV.read("distribution-of-arrival-weekdays.csv", DataFrame);
# Normalize the arrival times to 1
for i in 2:size(arr_times_df, 2)
    arr_times_df[:, i] = arr_times_df[:, i]/sum(arr_times_df[:, i])
end
# add extra column with arrival times in secs Float64
arr_times_df[!, :arrival_times] = (0:0.25:23.75)*3600.0;
# Convert data to arrays
times = convert(Vector, arr_times_df.arrival_times)
prob = convert(Vector, arr_times_df.private)

# Fit a probability distribution to the data
dist = fit(TriangularDist, times, weights(prob))

which of course doesn´t work.

tchebycheff · April 5, 2023, 2:58pm

You already have the density from the data, although in a histogram format. If you are looking for a smooth curve, perhaps you can try GaussianMixtures package.
Looking at the shape, you could fit a mixture of 2 or 3 Gaussians. Construct a GMM instance and fit to the observed likelihood?

daro.slai · April 13, 2023, 7:58pm

Well it actually took me a lot of time to figure out GaussianMixtures.jl and the rest of the StatsBase.jl environment, but this is what I came out with.

using StatsBase, StatsPlots, LinearAlgebra, IterTools, Distributions, Plots, CSV, DataFrames, 
using GaussianMixtures

function make_mixture(h)
    MixtureModel([Uniform(e...) for e in partition(h.edges[1], 2, 1)],
                 normalize(h.weights, 1))
end

edges=collect(0.0:0.25:24)
weights=arr_times_df.private;
h=Histogram(edges, weights)
m = make_mixture(h) # create the mixture empirical models of bins
gmm = MixtureModel(GMM(2, rand(m, 5000))); # create the GMM model with 2 peaks and 5000 samples out of the mixture model
print(gmm)
K-means converged with 12 iterations (objv = 15000.248816129308)
┌ Info: Initializing GMM, 2 Gaussians diag covariance 1 dimensions using 5000 data points
└ @ GaussianMixtures C:\Users\daros\.julia\packages\GaussianMixtures\zDaBV\src\train.jl:79
┌ Info: K-means with 2000 data points using 12 iterations
│ 500.0 data points per parameter
└ @ GaussianMixtures C:\Users\daros\.julia\packages\GaussianMixtures\zDaBV\src\train.jl:141
MixtureModel{Normal{Float64}}(K = 2)
components[1] (prior = 0.8222): Normal{Float64}(μ=19.447283283332077, σ=2.401753639052661)
components[2] (prior = 0.1778): Normal{Float64}(μ=10.907372546632375, σ=5.855637979445984)

Sadly I think the fit is not very good

plot(h)
plot!(gmm, func=pdf,linestyles=[:dash], linecolor=:lightblue, label="Components")
density!(rand(gmm, 100_000), linecolor=:lightblue, linewidth=2, xlabel = "Value", label = "Mixture model")

if i normalize h withnormalize(h, mode=:density) then the fit looks better:

but then sum(h.weights)=4.0 so…

tbeason · April 13, 2023, 8:11pm

Since you are working with a time, and time cannot be negative, you should incorporate this information into the estimation somehow. A convenient way, if time is always nonzero, is to work with log(time).

daro.slai · July 20, 2023, 1:43pm

Well my application is not super strict so I can just sample from the model, check the value and if its outside of the limits [0:24] I re-sample.
Now I’m trying to make forecasts out of this data. I’m more used to forecasting timeseries/statespace models, so any recommendation is more than welcomed.

Topic		Replies	Views
Fitting a Distribution to existing data Statistics	16	6425	January 5, 2017
Fitting a 1D distribution using Gaussian Mixtures General Usage statistics	18	3751	April 12, 2024
Finding parameter (via MLE) of distributons with binned data New to Julia question , statistics	1	455	July 21, 2021
How to bin data properly for further study, by example to plot it estimated PDf and CDF? New to Julia	12	3863	July 21, 2019
How to compute density of conditional probability from data Visualization statistics	6	1038	September 15, 2021

Problem fitting density function data

Related topics