Fitting a Distribution to existing data


#1

Good $(period_of_day) to all,

Let me preface this question by saying that there is a good chance (p > 0.8) that I don’t know what I’m talking about, so feel free to redirect me. I’ll explain what I have and what I need to do, and you can tell me what keywords I should search on.

What I have:

I have a DataFrame (although it could easily be an Array{Int64, 2}) that holds a data distribution. Column 1 is called :bucket and contains the upper edges of 200 fixed width buckets. Column 2 is called :count and contains the population of each bucket.

In most cases, the distribution is close to, but not exactly a Log-normal distribution. There are times when it is double humped, but I’ve narrowed this down to it being a combination of two Log-normal like distributions (in reality it is almost always a combination of multiple distributions, but typically one of them way outnumbers the others).

I can also determine the geometric mean and geometric standard deviation of this overall distribution

What I’m trying to do:

I’m trying to identify the components of the curve, ie, the most significant Log-normal distributions via their geometric means & geometric standard deviations.

What I’ve tried:

I’ve tried generating a random Log-normal distribution using something like this:

nos_n = randn(sum(df[:count]))    # df is the dataframe from above
nos_n += log(geometric_mean)
nos_n *= log(geometric_stddev)
nos_ln = Float64[1.5 ^ k for k in nos_n]

dist_ln = hist(nos_ln, df[:buckets])

This does give me a log-normal distribution, but it doesn’t match the distribution I have in the dataframe, so I basically keep trying this with smaller datasets until I get a distribution that fits inside the original distribution, then I subtract that from the original, and try again with the left-over.

My questions:

  1. Is there a better way to do this in Julia?
  2. Is there a standard name for what I’m doing?

Thanks for reading this far.

Philip


#2

I believe the name for this sort of thing is a “mixture model” and the Distributions package has utilities for simulating them. It looks like it also might be able to fit a mixture model, but I’m not sure.

EDIT: Fixed link; sorry about that.


#3

Thank you. I’ll look into the Distributions package.


#4

The Distributions package looks promising, but it doesn’t appear to have a Log-Normal distribution. Would you recommend I use the Normal distribution with some modifications?


#5

Is this it?


#6

Yes. Apparently I need to give my eyes a break right about now.


#7

Ok, so I’ve tried a bunch of things, but perhaps I’m not doing it correctly. This is what I have:

julia> using Distributions

julia> d = fit(LogNormal, df[:count])
Distributions.LogNormal{Float64}(μ=9.458033679099898, σ=1.2790898894673974)

This gets me a distribution, but it’s only taking into account the population of each bucket and not the actual bucket value, so the distribution range is 0.0:Inf, which is not actually true (my data runs from 0:20000.

I tried wrapping this in a Truncated(), but then it gives me the original distribution, just truncated rather than scaled to fit within my range.

The question is, how do I tell fit to use my range?

Note that I’ve also tried this:

julia> d = fit(LogNormal, df[:bucket], df[:count])

Since I figured I should use the x value and use count as a weight, but I get this error:

ERROR: suffstats is not implemented for (Distributions.LogNormal{T<:Real},DataArrays.DataArray{Int64,1},DataArrays.DataArray{Int64,1}).
 in error at julia/lib/julia/sys.dylib
 in fit at .julia/v0.4/Distributions/src/genericfit.jl:15

I get the same if I wrap the dataarrays in Array{Float64}()

Does anyone have any ideas on what I’m doing wrong?

Also, once I do get the fit correctly, how do I determine how close it is to my actual distribution? For now I’m using rand(d, n) to generate n random numbers following the distribution, then using hist on the resulting array, and comparing it to my starting distribution. The results are not very close.


#8

I don’t mean to be unhelpful, but it seems like your major problems aren’t related to Julia per se, but are broad confusions about statistical theory. Although I totally understand why you may not see it as a serious option, I’d encourage you to hold off on doing this kind of work until you (a) feel comfortable with concepts like maximum likelihood estimation and mixture models (and understand why MLE’s for mixture models often behave really badly) or (b) find a collaborator who understands those topics in some depth.

As is, I’m pretty worried that you’re essentially reinventing modern statistical theory by playing around with tools that have major, but subtle, weaknesses – but you won’t ever discover those weaknesses by re-analyzing your current dataset. Right now you’re learning a lot because the errors you’re getting are so extreme that they’re obvious to you immediately, but there’s a lot of non-obvious problems that might come up as well and I worry that you won’t discover them by writing Julia code.


#9

Seems both harsh and unhelpful?

I think if in most cases the distributions are close to log-normal, you should just stick to that, unless you have strong pre-hoc reasons (based on your specific knowledge of the problem in question). Otherwise you might be over-fitting.


#10

That’s why I’m on this forum :slight_smile:

Half the statistics I know I learnt from Wikipedia and the other half from reading other people’s source code.

As for the mixture model, at the moment I’m using a pure Log-normal distribution, but limited in range. It appears that the fit functions expect a vector of samples rather than a pre-computed set of bins, and perhaps that’s where I’m getting stuck.


#11

Thanks. In my first example (since I’m trying to figure out how this works), I’m using an actual log-normal distribution, however the two things that are tripping me up are:

  1. My data is restricted in range to 1-20000 rather than 0-Inf and
  2. I don’t have a vector of samples, I have a list of 200 precomputed bins, each of width 100, and from what I can tell, the fit functions expect samples rather than bins, though I could be reading it wrong.

#12

Is the range restriction an artifact of the binning/sampling process or an actual theoretical limit? Does it significantly truncate the tail in the data?

I poked around a bit and there’s an R package you might want to take a look at.

https://cran.r-project.org/web/packages/bda/bda.pdf.


#13

Hi bluesmoon,
if you already have binned data, you could simply fit the histogram data (i.e. bin count vs. bin center) to the supposed lognormal function. curve_fit from LsqFit.jl can be used for that. However, whether that approach fulfills your definition of statistical rigor, depends on your field – mathematicians will probably cringe…


#14

it’s an actual limit, the data starts tapering off around the 15000 mark, most bins after that have just 1 or 2 data points.

Thanks for the pointer to BDA, I’ll look into it, though I’ll have to weigh that against having to learn R to do this (assuming that’s necessary).


#15

Thank you. This looks promising. I’ve managed to get it working with artificial data, ie, this works:

using LsqFit

# Log-normal distribution
model(x, p) = 1 ./ (x * p[2] * √2π) .* exp( - (log(x) - p[1]).^2 / 2p[2]^2 )

params = [8.19265, 2.15105]
xdata = 1:100:10000001

ydata = model(xdata, params)

fit = curve_fit(model, xdata, ydata, [5., 1.])

fit.param
2-element Array{Float64,1}:
 8.19265
 2.15105

However, sum(ydata) gives me 0.009887721132178428 while I would expect the sum of the PDF to tend to 1. Am I doing something wrong or is my assumption incorrect?

To give you an indication of where I’m heading, my bins contain a count of items, so they range from 0 items to about 100,000 items. I need to normalize this in order to fit the curve, so my first attempt was to divide the counts by the total, however this doesn’t give me the parameters I expect.


#16

@bluesmoon: the sum of your ydata is close to 0.01 because you chose bins of size 100 when you built xdata. If you pick smaller bins of width 1, you get what you expect:

julia> xdata = 1:1:10000001;

julia> ydata = model(xdata, params);

julia> sum(ydata)
0.9998728357968255

For a histogram with bins of width 100, you can normalize it to an area of 1 by dividing the frequencies (bin counts) by the total count * 100.


#17

Thanks, this makes sense.