Problem fitting density function data

I have a dataset from Elaad.
Which looks like this:


the idea is to fit a given Distribution::D to the data. I wanted to try a couple of ideas but the fit needs to handle both the bins and the probability data.
I´ve tried using ´StatsBase.fit()´and ´Distributions.fit()´ but either they don´t handle ´weights´ or don´t have the ´suffstats´ of any of the asymmetric distributions I´ve tried.
My min work example for now is:

using StatsBase, StatsPlots, Distributions, Plots, CSV, DataFrames, LaTeXStrings
arr_times_df = CSV.read("distribution-of-arrival-weekdays.csv", DataFrame);
# Normalize the arrival times to 1
for i in 2:size(arr_times_df, 2)
    arr_times_df[:, i] = arr_times_df[:, i]/sum(arr_times_df[:, i])
end
# add extra column with arrival times in secs Float64
arr_times_df[!, :arrival_times] = (0:0.25:23.75)*3600.0;
# Convert data to arrays
times = convert(Vector, arr_times_df.arrival_times)
prob = convert(Vector, arr_times_df.private)

# Fit a probability distribution to the data
dist = fit(TriangularDist, times, weights(prob))

which of course doesn´t work.

You already have the density from the data, although in a histogram format. If you are looking for a smooth curve, perhaps you can try GaussianMixtures package.
Looking at the shape, you could fit a mixture of 2 or 3 Gaussians. Construct a GMM instance and fit to the observed likelihood?

Well it actually took me a lot of time to figure out GaussianMixtures.jl and the rest of the StatsBase.jl environment, but this is what I came out with.

using StatsBase, StatsPlots, LinearAlgebra, IterTools, Distributions, Plots, CSV, DataFrames, 
using GaussianMixtures

function make_mixture(h)
    MixtureModel([Uniform(e...) for e in partition(h.edges[1], 2, 1)],
                 normalize(h.weights, 1))
end

edges=collect(0.0:0.25:24)
weights=arr_times_df.private;
h=Histogram(edges, weights)
m = make_mixture(h) # create the mixture empirical models of bins
gmm = MixtureModel(GMM(2, rand(m, 5000))); # create the GMM model with 2 peaks and 5000 samples out of the mixture model
print(gmm)
K-means converged with 12 iterations (objv = 15000.248816129308)
┌ Info: Initializing GMM, 2 Gaussians diag covariance 1 dimensions using 5000 data points
└ @ GaussianMixtures C:\Users\daros\.julia\packages\GaussianMixtures\zDaBV\src\train.jl:79
┌ Info: K-means with 2000 data points using 12 iterations
│ 500.0 data points per parameter
└ @ GaussianMixtures C:\Users\daros\.julia\packages\GaussianMixtures\zDaBV\src\train.jl:141
MixtureModel{Normal{Float64}}(K = 2)
components[1] (prior = 0.8222): Normal{Float64}(μ=19.447283283332077, σ=2.401753639052661)
components[2] (prior = 0.1778): Normal{Float64}(μ=10.907372546632375, σ=5.855637979445984)

Sadly I think the fit is not very good

plot(h)
plot!(gmm, func=pdf,linestyles=[:dash], linecolor=:lightblue, label="Components")
density!(rand(gmm, 100_000), linecolor=:lightblue, linewidth=2, xlabel = "Value", label = "Mixture model")

if i normalize h withnormalize(h, mode=:density) then the fit looks better:


but then sum(h.weights)=4.0 so…

Since you are working with a time, and time cannot be negative, you should incorporate this information into the estimation somehow. A convenient way, if time is always nonzero, is to work with log(time).

1 Like

Well my application is not super strict so I can just sample from the model, check the value and if its outside of the limits [0:24] I re-sample.
Now I’m trying to make forecasts out of this data. I’m more used to forecasting timeseries/statespace models, so any recommendation is more than welcomed.