Why is statistics so difficult?

Statistics is hard because it’s Not mathematics but it looks like it.

Statistics is theory + data using mathematics. So as soon as you take the standard model of particles and their interactions and add some observations of collider events, you’re doing physics+data = statistics, as soon as you take a theory about gravitational waves and add LIGO data you’re doing physics + data = statistics, as soon as you think of a way you expect people to behave when trades are made available in a market and you add some observed trades you’re doing economics + data = statistics… as soon as you’ve got a theory about how chemicals signal in cells and you add some images of fluorescence you’re doing biology + data = statistics. Basically as soon as you’re not theorizing alone, you’re doing statistics. So statistics is all of science. and it would be surprising for all of science to be an easy thing to do.

5 Likes

I like their “Regression and other stories”. Rather accessible to non-experts, such as myself.

5 Likes

This is super important. Once you realize that the reason you are trying so hard to make a statistic pivotal is because people only had tables for a few key distributions, a lot of the annoying algebra in many problem sets makes more sense.

Another anachronism that makes statistics hard, imo, is a reliance on linear algebra. I can’t prove this, but I would bet that the reason so much of econometrics is framed in terms of linear algebra rather than just simple optimization is because earlier on computers were really good at matrix math. When all you have is a hammer…

There was a twitter thread recently about how CS grad students do not have anywhere close to enough linear algebra training to really understand graduate level coursework. This is true in stats and econometrics as well, which probably makes the subjects harder than they should be.

3 Likes

One of my major problems is especially that I did not get any training in numerical/computational linear algebra. So eg when I first encountered negative eigenvalues associated with a covariance matrix my training just told me that is not possible. Also no special training on algorithms for linalg but this is mostly solved by julia now.
However, I am still happy about my linalg theory training. (I studied econ with math major)

I think the usual argument for linearity is a first-order expansion. I can’t resist the urge to link another classic,

4 Likes

That’s an excellent short article. Thanks for that. Going to show my students that one.

1 Like

Delightful! What mastery of language.

In my opinion and experience, Statistics (and Econometrics) is often taught and learned in a very mechanical way. Statistics seems hard and not intuitive not because of lack of Math, it’s because it is taught only using math. There is a huge emphasis on “how” and very little about the “why”. It seems like that because Statistics is indeed a branch of Mathematics, and almost only the mechanics of it are taught. Many texts and online tutorials are taught as a long sequence of theorem/proof/theorem/proof/…etc.

It’s hard to be a good statistician without understanding the philosophical difference between Bayesian and Frequenist approches. You should understand what testing means and what can go wrong with it. What is significance? why do we insist on consistency? what are risk or loss functions? what efficiency means? Not just the definitions of them but why do we define them the way we do.
One needs to understand what misspecification means, what is over fitting, in sample/out of sample prediction. Again, not the definition but why we define it like that and what will/can go wrong if we had another definition or approach.
Seeing proofs and understanding where in the proof did we use each and every assumption we put is important. Understanding why the proof will break if we lift, for example, compactness, continuity or uniformity or any other assumption is important. Seeing counter examples is a great learning tool.
In my opinion, it is hard to get the “philosophy”, “big picture” view, context, comparisons and such from a one book or a paper. There is no replacement for a teacher who is a researcher in that field and who knows the why and not only the how. Some one with a certain conviction that wants to convince you in what they believe and then see someone from the other camp or with a different point of view. Seeing them argue with each other.
Don’t get me wrong, the “how” is SUPER important. Stat is really a field in Math. But also math is very philosophical. You don’t really get it if you don’t under stand the why. There is a reason why our degree is called a Ph.D. - a Philosophy doctorate in the field of X.
Larry Wasserman is an amazing teacher and I had the pleasure to hear him lecture. CMU should have his video taped lectures. I think.

2 Likes

Perhaps anachronism, but I browsed through Harald Cramér’s book from 1946 once (Mathematical Methods of Statistics (Princeton Mathematical Series, Number 9): Cramér, Harald: 9781114782860: Amazon.com: Books), and that really made me appreciate linear algebra and matrix description :-).

1 Like

One can say that Cramer rules…

4 Likes

I am a user of statistical tools (in behavioral neuroscience research), not a developer of new techniques or packages. I like to have an intuition for the tools I use, but at the end of the day, I put my trust into libraries/packages that are well regarded (e.g. lme4, mc-stan, brms, etc).
It’s essential to be clear about what you are trying to achieve when you consider a statistical (or analytic) tool. I really like Andrew Gelman’s blog to get a dose of reality when it comes to stats. If your primary goal is hypothesis testing (e.g. to publish a paper in a scientific journal) and you want to make sure you are doing “the right thing”, the truth is, if your effect is robust and large, it doesn’t matter much what tool you use. If your result is small and you need to do a lot of work to find some test that gives you “significance” then, unfortunately, you are doing several bad things (e.g. p-hacking). That said, I think if you follow these rules, you will avoid a lot of pain (in random order).

  1. Use permutation tests when possible. Don’t trust the p-values from the ANOVA or GLM. Shuffle the label of interest in your data 1000 times and use that as your null distribution. With fast computers, there is no reason to assume that your data come from a specific analytic distribution.
  2. Use (generalized) linear mixed-models (like lme4, MixedModels.jl, brms or rstanarm). Most “classic” statistical tests can be described as a linear (or generalized-linear) model Common statistical tests are linear models (or: how to teach stats) (but combine these with permutation tests or nested-models)
  3. Use synthetic data. Create data where you know the ground truth and made sure you can recover the generative parameters.
  4. Do things multiple ways. Is the conclusion of your paper the same regardless of the statistical tool you used? Then probably you don’t have to worry too much about your choices.
  5. Use cross-validation when possible. This is a huge topic, and there are some cases where cross-validation can be imperfect, but it’s generally better than AIC or BIC and easier to understand than things like MDL.

If you want to use Bayesian stats, Michael Betancourt’s writing is excellent.

12 Likes

My personal experience of that struggle is that fundamentally stats is an applied math topic that:

  • is taught from an early young age when one doesn’t have the math theoretical toolkit
  • is therefore almost always taught using combinatorics to illustrate the concepts
  • never taught along an epistemology class.

Statistics is (to me) an intellectual workflow where it is easy to get lost between theoretical assumptions (e.g. probability distribution of an event), theoretical facts (e.g. central limit theorem), practical assumptions/decisions (how to design an experiment) and practical facts. I would have loved to have if only a few hours to be taught that intellectual workflow.

Another problem of stats is that it is similar to the problem of computer science: most scientists specialise in a domain which is neither which absorbs most of their time, but have to use stats and write code which they will never be able to do to the level they would love too.

PS: As a side note, I am getting to be more and more fascinated by Bayes factors in particular when expressed in logarithm because of the connection to information theory. Approaching stats as gathering limited but quantifiable information content from observations/experiments to make certainty/uncertainty statements sounds like a deep way to express stats. If anybody has pointers on that, I would be grateful.

5 Likes

I always find it useful to remind myself that the name “statistics” is itself a source of confusion. A statistic is a quantity calculated based on experiments: if I measure body heights, I can calculate their average. Body heights are not statistics, they are a measurement, whereas the average is a statistic.

Historically at least, statistics (with s) can be seen as studying the properties of a statistic (no s) when making some assumptions about the measurements. E.g. IF heights follow a normal dist, what is the distribution of the average? Conversely, if we observe a certain behaviour of the statistic (no s), what can we state about the behaviour of the source of the measuments. Classic example: t-stat is a statistic, Student’s t-test considers the behaviour of the t-stat assuming that the measurements follow a normal distribution.

Approaching stats as gathering limited but quantifiable information content from observations/experiments to make certainty/uncertainty statements sounds like a deep way to express stats. If anybody has pointers on that, I would be grateful.

From the way you express this PS, I’m guessing you are aware of Jaynes’ book, if not, I guess you will like it.

2 Likes

@arzwa, the linked webpage contains a PDF version of a copyrighted book. Not sure if this was created with permission of the copyright holder, I would guess it was, but in case of doubt, it might be safer to just reference the book?

Whoops, I think you’re right, I was convinced the book (Jaynes’ probability theory – the logic of science) was freely available. I’ll link to this page instead for further references (where it is noted that ‘The publisher, Cambridge, requested the version of the book that was online be removed to avoid copyright problems’).

1 Like

I wasn’t aware about this work. I’ll have a look if only to see how different it is from fuzzy logic. A casual glance gives me the impression that it is different from what I have in mind.
I have been thinking about how the Bayes rule becomes much easier to apprehend when expressed using odds instead of probabilities, and even clearer when using log-odds. The log-Bayes factor can be read as the information useful content of a new piece of information.
Given that this way of presenting Bayesian probabilities is more expressive this way, I have been lazily looking for more references: what would it mean and look like to re-express traditional tools such as probability distributions in log-odds format. I imagine someone has already looked at it.

I don’t think the logic aspect of Jaynes’ book is that interesting (it’s definitely not the strongest aspect of the book, see e.g. this essay for some reasons why). It is however an enjoyable, original and often polemical work on probability theory and statistics. What you write here seems to be more or less what Jaynes starts to treat in chapter 4 of the book.

3 Likes

You may want to also look at Kevin Van Horn’s response to David Chapman’s essay cited.

3 Likes

On a related note, another classic (on causality) is

http://bayes.cs.ucla.edu/BOOK-2K/

I also recommend the other books of Judea Pearl.