Let me start by saying that I’m very inexperienced in this area, so go easy on me @Tamas_Papp . I have two datasets that are similar in shape and I’m trying to fit a distribution to each of them with the goal of being able to make probability statements about the processes that generated the data. For example, I want to be able to say that, “Under process A, the probability of a measurement being > $10,000 is 0.1 while under process B, the probability of a measurement being > $10,000 is 0.3 (or something like that).”
The problem is that there are a lot of zero values in the data and the data are bunched up around the lower end of the spectrum but the range is very wide, so I’m not sure what kind of distribution is appropriate. I’ve tried to deal with the zeros by doing log transformations (log.(data .+ 10), for example) , taking the square root, etc., but I’m not having any luck. I’m using the Distrbutions.jl package.
For one of the data sets, the summary stats look like this:
Missing Count: 0
1st Quartile: 0.000000
3rd Quartile: 4129.670000
The 99th percentile is 138,688. I decided to lop the top 1% off with the hope that it would be easier to fit a distribution, but I’m still coming up short. Without the top 1% (so only including values <= 138,688) a histogram of the data looks like this:
What I’ve tried is to fit basically every distribution possible form the Distributions.jl package, via the
fit function, and then I do a
qqplot (from StatsPlots) and the data never fit the distribution, no matter what I do.
My questions are:
1: Is there a distribution that would be a good natural fit for this kind of dataset?
2: Should I exclude the zero-value data points?
3: Should I exclude more of the upper-end values (maybe only keep the top 90% or 95%)?
4: Should I explore more ways to transform the data?
5: The distributions.jl docs talk about creating your own distribution, should I go down that road?
Any feedback is very much appreciated!