Good $(period_of_day) to all,
Let me preface this question by saying that there is a good chance (p > 0.8) that I don’t know what I’m talking about, so feel free to redirect me. I’ll explain what I have and what I need to do, and you can tell me what keywords I should search on.
What I have:
I have a DataFrame
(although it could easily be an Array{Int64, 2}
) that holds a data distribution. Column 1 is called :bucket
and contains the upper edges of 200 fixed width buckets. Column 2 is called :count
and contains the population of each bucket.
In most cases, the distribution is close to, but not exactly a Log-normal distribution. There are times when it is double humped, but I’ve narrowed this down to it being a combination of two Log-normal like distributions (in reality it is almost always a combination of multiple distributions, but typically one of them way outnumbers the others).
I can also determine the geometric mean and geometric standard deviation of this overall distribution
What I’m trying to do:
I’m trying to identify the components of the curve, ie, the most significant Log-normal distributions via their geometric means & geometric standard deviations.
What I’ve tried:
I’ve tried generating a random Log-normal distribution using something like this:
nos_n = randn(sum(df[:count])) # df is the dataframe from above
nos_n += log(geometric_mean)
nos_n *= log(geometric_stddev)
nos_ln = Float64[1.5 ^ k for k in nos_n]
dist_ln = hist(nos_ln, df[:buckets])
This does give me a log-normal distribution, but it doesn’t match the distribution I have in the dataframe, so I basically keep trying this with smaller datasets until I get a distribution that fits inside the original distribution, then I subtract that from the original, and try again with the left-over.
My questions:
- Is there a better way to do this in Julia?
- Is there a standard name for what I’m doing?
Thanks for reading this far.
Philip