How do I fit a distribution to these data sets?!

mthelm85 · August 31, 2019, 5:00pm

Let me start by saying that I’m very inexperienced in this area, so go easy on me @Tamas_Papp . I have two datasets that are similar in shape and I’m trying to fit a distribution to each of them with the goal of being able to make probability statements about the processes that generated the data. For example, I want to be able to say that, “Under process A, the probability of a measurement being > $10,000 is 0.1 while under process B, the probability of a measurement being > $10,000 is 0.3 (or something like that).”

The problem is that there are a lot of zero values in the data and the data are bunched up around the lower end of the spectrum but the range is very wide, so I’m not sure what kind of distribution is appropriate. I’ve tried to deal with the zeros by doing log transformations (log.(data .+ 10), for example) , taking the square root, etc., but I’m not having any luck. I’m using the Distrbutions.jl package.

For one of the data sets, the summary stats look like this:

Summary Stats:
Length: 30239
Missing Count: 0
Mean: 9011.465678
Minimum: 0.000000
1st Quartile: 0.000000
Median: 250.400000
3rd Quartile: 4129.670000
Maximum: 3607200.690000
Type: Float64

The 99th percentile is 138,688. I decided to lop the top 1% off with the hope that it would be easier to fit a distribution, but I’m still coming up short. Without the top 1% (so only including values <= 138,688) a histogram of the data looks like this:

What I’ve tried is to fit basically every distribution possible form the Distributions.jl package, via the fit function, and then I do a qqplot (from StatsPlots) and the data never fit the distribution, no matter what I do.

My questions are:

1: Is there a distribution that would be a good natural fit for this kind of dataset?
2: Should I exclude the zero-value data points?
3: Should I exclude more of the upper-end values (maybe only keep the top 90% or 95%)?
4: Should I explore more ways to transform the data?
5: The distributions.jl docs talk about creating your own distribution, should I go down that road?

Any feedback is very much appreciated!

Tamas_Papp · August 31, 2019, 5:46pm

Sorry, I have no idea why you pinged me above, except possibly if you want to fit a distribution using Bayesian methods. This would be able to answer all your questions, but requires some prior expertise. For this distribution in particular, a standard recommendation would be trying overdispersed families.

Regarding QQ plots: they are not necessarily good diagnostics for the very edge of tails, especially if you don’t have a lot of values there.

Yifan_Liu · August 31, 2019, 6:47pm

This data seems to be truncated at 0, you can take natural log of it and then fit the log form data to certain distributions.

MatFi · August 31, 2019, 8:34pm

If I got you correctly, you just need a fitting distribution and there is no need that the parameters of the distribution have any meaning. So you are actually free to define your own distribution to fit the data.
What speaks against a PDF of
f = \frac{1}{n} \sum_{i=1}^n \lambda_i exp(-\lambda_i x)
or similar. Then then doing the rest (CDF) numerically.
And for this the suggestion from @Yifan_Liu with doing the fit on a log scale is a good idea btw.

EDIT: forgot the norm
the CDF should be somthing like
f_{CDF}= \frac{1}{n} \sum_{i=1}^n (1- exp(-\lambda_i x))

jkbest2 · August 31, 2019, 9:30pm

Without more information about how these data were generated and your end goals, it’s going to be pretty difficult for anyone to help. That said, you probably want to look into zero-inflated models. One approach is to fit separate processes for the zeros and the positives. Here you’d have something like

p0 = fit(Bernoulli, y1 .== 0)
pos = fit(Exponential, y1[y1 .> 0]

You can then e.g. generate new data points as

rand(p0, 100) .* rand(pos, 100)

MatFi · September 1, 2019, 5:59am

I just have lerned from this thread that there is EmpiricalCDFs which should fulfill your purpose as well

Topic		Replies	Views
Fitting a Distribution to existing data Statistics	16	6427	January 5, 2017
Need advice when data almost (but not quite) fit log normal dist Statistics distributions	7	924	January 13, 2021
Fitting of distribution histogram - axis-limit issues and problems with defining goodness of fit? General Usage question , package , plotting	1	960	October 19, 2020
How to fit a normal distribution to some points in the tail Specific Domains	18	1348	February 28, 2020
Automated distribution fitting Statistics	11	4207	August 26, 2022

How do I fit a distribution to these data sets?!

Related topics