Statistic for differentiating two distributions?

I need some advice on choosing a proper statistical test for project I will be starting soon. My expectation is that the data will consist of two distinct datasets with similar (normal?) distributions, but different means and standard deviations. Each dataset will also have <100 samples. I need a way to test/show that they are different, so my question is, what is(are) the best test(s) for a problem like this? The only one I came across that seemed reasonable is a 2-sided K-S test, though I am not certain that will tell me what I want. I expect my data to look similar to the plot below, only with far fewer samples. Thank you in advance for any guidance you provide!


If u assume they r normals then two sample t test.

There are various other tests, eg Anderson-Darling (also implemented in HypothesisTests.jl). Each of these tests compares different aspects — you can find discussions of this, eg

You just have to pick aspects you care the most about, there is no automatic solution. If you can simulate/bootstrap, it is also quite easy (though computationally more intensive) to come up with custom tests.

As usual, the more you know about the distributions, the more power the test will have. Eg if you can robustly transform to a normal (eg Box-Cox), then @xiaodai’s suggestion should be best.

My expectation going in is that the dominate error will be randomly/normally distributed, but there may be some skew (not just in the strictly defined sense) to the data that I do not foresee. I will look into the two sample t test if the data appear to be normally distributed, but I think I need an option that does not require that assumption.

That paper looks great thanks! I was going to try to search for a comparison like that, but I had no idea where to begin. As you say, I will probably need to wait for the actual data to come in before I decide for sure on a plan. I was just hoping to get a jump on things. For the future, my goal was to be able to take a single new data point, then compare it to the two distributions (assuming they are in fact different) and decide to which one it belongs.

If that’s your end goal, it should be a simple exercise in Bayesian inference, possibly without a need to actually answer the question whether the two distributions are different.

Specifically, suppose you have a lot of data points and can reliably (ie with small posterior uncertainty) estimate two parametric distributions f(x; \theta) and g(x; \phi). The update to the odds ratio of a new point x' coming from f is just f(x'; \theta)/g(x';\phi) if we pretend that we have so much data that it’s not going to update \theta or \phi significantly. Fully taking posterior uncertainty into account complicates this, but just technically.

That is an interesting idea, I was wondering if I could use some kind of modeling instead, or more likely, in addition to the idea I presented earlier. I will have well-defined errors on each point that I assume I will need to take into account, but I can cross that bridge when I come to it.