Hey Community,
I am facing the following “problem”:
I am supposed to find a growth rate distribution in a certain data set. For this I am looking at new objects being created at a timestep. These counts I want to illustrate with a histogram, fit with an appropriate distribution and in the best case even evaluate the goodness of the fit.
After getting the different rates which are stored in the array “differences”, I am filtering only the values bigger than 10:
using StatsBase
using Distributions
using StatsPlots
daily_rates = []
pop = countmap(differences)
#only take the rates that are higher than zero
filtered_pop = filter(tuple -> last(tuple) > 10, collect(pop))
#now we have a tuple and we only want to have the appearances
pop_vals = [filtered_pop[i][2] for i in 1:length(filtered_pop)]
sort!(pop_vals)
daily_rates=pop_vals
which gives me the following array:
[11, 11, 11, 12, 12, 12, 13, 14, 15, 15, 17, 18, 21, 24, 25, 25, 28, 38, 39, 43, 47, 54, 55, 64, 87, 99, 127, 187, 237, 306, 413, 611, 933, 1563, 3271]
This array I am fitting with the Distributions.jl package where I tried the Pareto distribution with the maximum likelihood method:
P = fit_mle(Pareto, daily_rates)
To plot the distribution as a straight line fit I also get the unique values of the array above, to have some x-values:
x = unique(daily_rates);
Now I am plotting the histogram and the fit with StatsPlots.jl and I chose logartihmic binning:
StatsPlots.histogram(daily_rates, bins = 10 .^range(0.0, length = 101, stop=log10(maximum(daily_rates))), fillalpha = 0.4,normalize=true, xaxis=:log, yaxis=:log, xlims = (10, maximum(daily_rates)),label=:data)
StatsPlots.plot!(P,x, xaxis=:log, yaxis=:log,label=:fit)
If I plot it, the histogram bars do not start at the very zero line. Which I don’t really understand.
This is my first issue and I would be happy for any hint what I can do to change it.
Moreover now it would be very useful the get a statement regarding the goodness of the fit. I see two options:
Either I have to calculate for example the chi squared after pearson manually with:
But for this I would need my exact y-values of the histogram for which I didn’t find any function that could print me these values.
Or I am using the HypothesisTests.jl package by which I am a bit overwhelmed such that I didn’t figure out yet how to use it in my case since I am quite unexperienced.
I don’t know if I am asking for too much help - but I would be very thankful for any reply.
Thanks in advance!