Fitting of distribution histogram - axis-limit issues and problems with defining goodness of fit?

Hey Community,

I am facing the following “problem”:
I am supposed to find a growth rate distribution in a certain data set. For this I am looking at new objects being created at a timestep. These counts I want to illustrate with a histogram, fit with an appropriate distribution and in the best case even evaluate the goodness of the fit.

After getting the different rates which are stored in the array “differences”, I am filtering only the values bigger than 10:

using StatsBase
using Distributions
using StatsPlots

daily_rates = []
pop = countmap(differences)
#only take the rates that are higher than zero 
filtered_pop = filter(tuple -> last(tuple) > 10, collect(pop))

#now we have a tuple and we only want to have the appearances
pop_vals = [filtered_pop[i][2] for i in 1:length(filtered_pop)]
sort!(pop_vals)
daily_rates=pop_vals

which gives me the following array:

[11, 11, 11, 12, 12, 12, 13, 14, 15, 15, 17, 18, 21, 24, 25, 25, 28, 38, 39, 43, 47, 54, 55, 64, 87, 99, 127, 187, 237, 306, 413, 611, 933, 1563, 3271]

This array I am fitting with the Distributions.jl package where I tried the Pareto distribution with the maximum likelihood method:

P = fit_mle(Pareto, daily_rates)

To plot the distribution as a straight line fit I also get the unique values of the array above, to have some x-values:

x = unique(daily_rates);

Now I am plotting the histogram and the fit with StatsPlots.jl and I chose logartihmic binning:

StatsPlots.histogram(daily_rates, bins = 10 .^range(0.0, length = 101, stop=log10(maximum(daily_rates))), fillalpha = 0.4,normalize=true, xaxis=:log, yaxis=:log, xlims = (10, maximum(daily_rates)),label=:data)
StatsPlots.plot!(P,x, xaxis=:log, yaxis=:log,label=:fit)

If I plot it, the histogram bars do not start at the very zero line. Which I don’t really understand.

rate_distribution

This is my first issue and I would be happy for any hint what I can do to change it.

Moreover now it would be very useful the get a statement regarding the goodness of the fit. I see two options:
Either I have to calculate for example the chi squared after pearson manually with:
Chi_sq
But for this I would need my exact y-values of the histogram for which I didn’t find any function that could print me these values.
Or I am using the HypothesisTests.jl package by which I am a bit overwhelmed such that I didn’t figure out yet how to use it in my case since I am quite unexperienced.

I don’t know if I am asking for too much help - but I would be very thankful for any reply.
Thanks in advance! :slight_smile:

1 Like

The following gives histogram y values

using Distributions
using StatsBase
f = fit(Histogram, [0.3, 0.5, 0.7], 10 .^ [-1, -0.5, 0, 0.5])
f.weights

(I didn’t find a function to extract the weights and accessing them like this feels dirty.)