How do I fit a distribution to these data sets?!

Let me start by saying that I’m very inexperienced in this area, so go easy on me @Tamas_Papp :wink:. I have two datasets that are similar in shape and I’m trying to fit a distribution to each of them with the goal of being able to make probability statements about the processes that generated the data. For example, I want to be able to say that, “Under process A, the probability of a measurement being > $10,000 is 0.1 while under process B, the probability of a measurement being > $10,000 is 0.3 (or something like that).”

The problem is that there are a lot of zero values in the data and the data are bunched up around the lower end of the spectrum but the range is very wide, so I’m not sure what kind of distribution is appropriate. I’ve tried to deal with the zeros by doing log transformations (log.(data .+ 10), for example) , taking the square root, etc., but I’m not having any luck. I’m using the Distrbutions.jl package.

For one of the data sets, the summary stats look like this:

Summary Stats:
Length: 30239
Missing Count: 0
Mean: 9011.465678
Minimum: 0.000000
1st Quartile: 0.000000
Median: 250.400000
3rd Quartile: 4129.670000
Maximum: 3607200.690000
Type: Float64

The 99th percentile is 138,688. I decided to lop the top 1% off with the hope that it would be easier to fit a distribution, but I’m still coming up short. Without the top 1% (so only including values <= 138,688) a histogram of the data looks like this:

image

What I’ve tried is to fit basically every distribution possible form the Distributions.jl package, via the fit function, and then I do a qqplot (from StatsPlots) and the data never fit the distribution, no matter what I do.

My questions are:

1: Is there a distribution that would be a good natural fit for this kind of dataset?
2: Should I exclude the zero-value data points?
3: Should I exclude more of the upper-end values (maybe only keep the top 90% or 95%)?
4: Should I explore more ways to transform the data?
5: The distributions.jl docs talk about creating your own distribution, should I go down that road?

Any feedback is very much appreciated!

Sorry, I have no idea why you pinged me above, except possibly if you want to fit a distribution using Bayesian methods. This would be able to answer all your questions, but requires some prior expertise. For this distribution in particular, a standard recommendation would be trying overdispersed families.

Regarding QQ plots: they are not necessarily good diagnostics for the very edge of tails, especially if you don’t have a lot of values there.

This data seems to be truncated at 0, you can take natural log of it and then fit the log form data to certain distributions.

1 Like

If I got you correctly, you just need a fitting distribution and there is no need that the parameters of the distribution have any meaning. So you are actually free to define your own distribution to fit the data.
What speaks against a PDF of
f = \frac{1}{n} \sum_{i=1}^n \lambda_i exp(-\lambda_i x)
or similar. Then then doing the rest (CDF) numerically.
And for this the suggestion from @Yifan_Liu with doing the fit on a log scale is a good idea btw.

EDIT: forgot the norm
the CDF should be somthing like
f_{CDF}= \frac{1}{n} \sum_{i=1}^n (1- exp(-\lambda_i x))

1 Like

Without more information about how these data were generated and your end goals, it’s going to be pretty difficult for anyone to help. That said, you probably want to look into zero-inflated models. One approach is to fit separate processes for the zeros and the positives. Here you’d have something like

p0 = fit(Bernoulli, y1 .== 0)
pos = fit(Exponential, y1[y1 .> 0]

You can then e.g. generate new data points as

rand(p0, 100) .* rand(pos, 100)
2 Likes

I just have lerned from this thread that there is EmpiricalCDFs which should fulfill your purpose as well

2 Likes