# Need advice when data almost (but not quite) fit log normal dist

This isn’t really a question about Julia so maybe this should be tagged as Offtopic, but I’m working with a data set that has several series of data that all closely follow a log normal distribution, but not quite. The plot below shows three of these series: the dots are a plot of an empirical CDF of each series (`Distributions.ecdf`) while the lines are the plotted CDFs from fitted log normal distributions (`Distributions.fit(LogNormal, x)`).

When sampling from the fitted distributions, the mean/median values of the samples are consistently greater than they should be because of the consistent error in the fit (as seen above).

My questions are:

1. Is it better/easier to attempt to transform the data so that they better fit the distribution, or is it better to try to tweak the cumulative distribution function so that it fits the data?

2. If implementing a custom distribution is the recommended approach, can anyone provide some guidance as to how I would go about tweaking the log normal CDF? I’m at the boundaries of my stats/math knowledge here and I don’t know where to start really. I have more series than the 3 shown above and the error in the fit looks the same for all of them - the slope of the curve needs to be a bit flatter at lower values and then there is a point at which the slope of the tails needs to be steeper.

The reason that I want to do this is that I have additional series of similar data that don’t have nearly as many observations so I’d like to be able to run simulations/make predictions about those data sets knowing that they will follow this same shape as more observations become available (basically, I need to be able to predict what the future observations might look like).

Are you positive that the data comes from a theoretical log-normal? Have you considered the literature on extreme value theory? GitHub - JuliaEarth/ExtremeStats.jl: Extreme value statistics in Julia

1 Like

No, not at all - it was just the closest fit I could find. I’m not familiar with extreme value theory so I will check this out - thank you very much!!

In keeping with the extreme value literature suggestion, you should find some more stuff here Researchers.One (Nassim Taleb - Technical Incerto).

1. In general, if you do have a log-normal distribution then you need a lot of example points in order to find a correctly parameterised fit.
2. I would expect your sample mean from the fitted distro to be greater than the value in the original observations - based on your description of the fits.
1 Like

@daveh19 thanks! You are correct, they are consistently greater (fixed in the OP).

Then that’s not such a bad situation. For a log-normal (or extremal) distro even a single extremal observation will move the mean by a lot, so it’s not so bad to be overshooting what your current data shows you…
This depends on your use case however.

1 Like

Not sure whether this is an option but could you test which of your (empirical) distributions the new data comes from, and then just sample from the ECDF that most closely resembles the data you have up to that point?

1 Like

That would work for some of the series because there are a couple that, when plotting their ECDF as I show above, lie pretty much on the same line…for most of them though, the ECDF is distinct as the three shown above are.

I like the extreme value theory tips - this seems really promising based on what I’ve read so far. These data series have a lot of zeros, then the bulk of the data lie within a pretty narrow range, but then you have these insanely long tails that cannot be ignored. I’ve been struggling to figure out what to use as a cutoff threshold for outliers and, after reading up on EVT, I now know why : )

The data are actually monetary values. It’s private data so I can’t disclose much about it, but you can almost think of them as purchase amounts from different stores selling different categories of goods/services(for the different series)…lots of people don’t buy anything so there are lots of \$0.00…most people that do buy something don’t spend very much (maybe between \$20 and a few thousand) but then there are always a fair number of really large transactions that span anywhere from a few tens of thousands to hundreds of thousands to values in the millions.

I have no idea but I bet jewelry store transactions look a lot like this data set