I have discrete data from a continuous field.
The data probability distribution is not known, but it has only one peak value, what I guess is called “mode” of the pdf. The normalized histograms from the data look like skew normal distributions.
Is there any package that I could use that would straightforwardly give me an estimation for the most likely value of the pdf?
The function mode from Distributions module doesn’t seem to work for my purpose. It will output the most repeated value of my set, which most often doesn’t have any repeated values (data of float values).
So far I’ve been fitting a histogram using StatsBase and manually tinkering with the number of bins and edges until I get a histogram whose bin is thin enough for me to “manually” visualise the possible value for the mode and the bin width is small relative to the value itself.
I was wondering if there is a more straightforward way to do that using any of the available packages out there.
You could trying to fit a parametric family (eg a skew normal, or an overdispersed skew normal), and then obtain the mode either in closed form or by maximizing the PDF. The fitting you can do using maximum likelihood, or maximum a posteriori (which would provide some regularization).
As a shortcut, you can just transform the data (eg take logs, sqrt, or even Box-Cox), and then fit a normal using ML (which I believe is built into Distributions.jl for Normal).
Note, however, that the mode is not invariant under bijections, and thus may be not be a robust statistic (but this depends on context).
Is your data 1D? From your description of the problem, the most natural solution would be to fit a “skewed normal” distribution with Distributions.jl and then access its mode, which in some cases has closed form. If these distributions are skewed because of extreme values consider the ExtremeStats.jl package. I didn’t implement the the most stable fitting methods (e.g. method of moments) but the implemented maximum likelihood may work.
Thank you all for the suggestions. The data is 1D.
It does seem the most forward solution would be to fit a skewed normal distribution, but I it doesn’t seem to implemented in Distributions.jl??
Would this distribution have a different name that I am not aware of?
Is it implemented in a different package?
@yakir12, that is a witty hacky way to do it. I like it… hahaha. I’ll give it a try and see how it work out.
@Tamas_Papp, I’m calculating one of those quick and dirt coefficients for engineering applications, so I don’t think I have to worry much about it’s robustness .
then, to reiterate, simply “unskewing” the distribution and fitting a normal using Distributions.jl could serve you well. Eg see
or variations on this theme, eg x \mapsto \log(x + a) for some a (hard to recommend anything precise without the data). Another popular transformation is the square root.
Just a comment: in general, estimating a mode for a continuous distribution is generally much, much more complicated than it seems on first pass.
Some of the solutions described here involve a transformation of the data, and then taking the mode in the transformed space, and presumable transforming this back to the original space. Note that in general, there’s no reason to believe this will actually find the mode in the original space! As a simple example, consider the exponential distribution. This is a non-negative distribution with a mode at 0, yet if you take a normalizing transformation, take the mode and then transform back, you will get a very different mode, even when using the exact distribution of the data.
Now add noise from only seeing a random sample and things get much, much worse. Basically, when using something like a non-parametric approach, the estimated mode will vary wildly not only by small changes in the data but also by the smoothing parameters selected.
If there’s not a very strong reason for using the mode, I’d actually strongly suggest trying to find a more stable metric, such as the median. For many distributions, the median and mode will be close. For the distributions in which they are very different (such as exponential), I have a lot of trouble seeing a strong argument for using the mode over the median.
Another thing to consider: if the median is meaningfully different than the mode, then at least 50% of the data is meaningfully different than the mode. So it becomes very questionable to use the mode as a representation of what will typically be observed.