Statistic for differentiating two distributions?

mihalybaci · November 25, 2019, 1:15pm

I need some advice on choosing a proper statistical test for project I will be starting soon. My expectation is that the data will consist of two distinct datasets with similar (normal?) distributions, but different means and standard deviations. Each dataset will also have <100 samples. I need a way to test/show that they are different, so my question is, what is(are) the best test(s) for a problem like this? The only one I came across that seemed reasonable is a 2-sided K-S test, though I am not certain that will tell me what I want. I expect my data to look similar to the plot below, only with far fewer samples. Thank you in advance for any guidance you provide!

normal_plot

xiaodai · November 25, 2019, 1:21pm

If u assume they r normals then two sample t test.

Tamas_Papp · November 25, 2019, 1:36pm

There are various other tests, eg Anderson-Darling (also implemented in HypothesisTests.jl). Each of these tests compares different aspects — you can find discussions of this, eg

https://www.researchgate.net/publication/267205556_Power_Comparisons_of_Shapiro-Wilk_Kolmogorov-Smirnov_Lilliefors_and_Anderson-Darling_Tests

You just have to pick aspects you care the most about, there is no automatic solution. If you can simulate/bootstrap, it is also quite easy (though computationally more intensive) to come up with custom tests.

As usual, the more you know about the distributions, the more power the test will have. Eg if you can robustly transform to a normal (eg Box-Cox), then @xiaodai’s suggestion should be best.

mihalybaci · November 25, 2019, 3:19pm

My expectation going in is that the dominate error will be randomly/normally distributed, but there may be some skew (not just in the strictly defined sense) to the data that I do not foresee. I will look into the two sample t test if the data appear to be normally distributed, but I think I need an option that does not require that assumption.

mihalybaci · November 25, 2019, 3:23pm

That paper looks great thanks! I was going to try to search for a comparison like that, but I had no idea where to begin. As you say, I will probably need to wait for the actual data to come in before I decide for sure on a plan. I was just hoping to get a jump on things. For the future, my goal was to be able to take a single new data point, then compare it to the two distributions (assuming they are in fact different) and decide to which one it belongs.

Tamas_Papp · November 25, 2019, 3:30pm

If that’s your end goal, it should be a simple exercise in Bayesian inference, possibly without a need to actually answer the question whether the two distributions are different.

Specifically, suppose you have a lot of data points and can reliably (ie with small posterior uncertainty) estimate two parametric distributions f(x; \theta) and g(x; \phi). The update to the odds ratio of a new point x' coming from f is just f(x'; \theta)/g(x';\phi) if we pretend that we have so much data that it’s not going to update \theta or \phi significantly. Fully taking posterior uncertainty into account complicates this, but just technically.

mihalybaci · November 25, 2019, 3:35pm

That is an interesting idea, I was wondering if I could use some kind of modeling instead, or more likely, in addition to the idea I presented earlier. I will have well-defined errors on each point that I assume I will need to take into account, but I can cross that bridge when I come to it.

Topic		Replies	Views
Kolmogorov-Smirnov test Statistics distributions , hypothesis-tests	22	2917	April 30, 2025
Package Distributions gives different resullt for standard deviation General Usage	3	646	June 7, 2017
Hypothesis testing with custom densities General Usage statistics	0	497	October 31, 2017
Testing multimodal distribution Statistics question , package	5	1434	April 12, 2024
Kolmogorov-Smirnov test for PDF Statistics	5	281	June 2, 2023

Statistic for differentiating two distributions?

Related topics