Why is statistics so difficult?

Hi all! I see a few very interesting discussions going on on this forum, and would like to ask for your opinions on the difficulty of statistics. I see many people hanging around here with much experience with stats, and also a clear way of explaining things. :slight_smile:

The reason I ask this is because statistics keeps baffling me. A few weeks ago, I finally had the idea that I understood some very basics of causality, and Bayesian and Frequentist stats. Recently, I’ve read some more and am completely at a loss. This time, the authors were arguing about the merits of Bayes factors, and I just couldn’t get it to click onto my previous knowledge. It seems like each and every author speaks its own language. I’ve have this time and again. As another example, I find deep learning papers pretty easy to read, but don’t see the connection with statistics as used in the empirical sciences (for example, p-values, Bayes factors, ANOVAs, and multilevel models).

So, to make my question more explicit: I wonder what are your thoughts on

  1. why it is difficult to combine knowledge of different branches of statistics,
  2. how to best approach learning statistics with the help of Julia (avoiding R and Python), and
  3. your experiences with learning it; would you have approached it differently in hindsight?
14 Likes

If you haven’t, I would start with Wasserman’s All of Statistics. If you’re not ready for that book, I would learn the required amounts of calculus and linear algebra first and then come back.

Statistics is so hard to learn because it’s a branch of mathematics that people pretend isn’t a branch of mathematics and so they end up teaching it very poorly; for example, people try to teach you intuitions instead of teaching you theorems, but the intuitions aren’t precise enough to prevent you from misunderstanding what the theorems really say. Stuff like “for all distributions there exists an N such that […]” versus “there exists an N such that for all distributions […]” is vitally important to correctly representing the claims of statistical theory, but those kinds of distinctions are exactly what tends to get lost in non-mathematical treatments of the field.

24 Likes

I think your overall question (in the topic title) is a great one, albeit you’re probably also right in making it more concrete as I’m not sure it would have an actual answer otherwise :slight_smile:

Some personal observations that crossed my mind reading your post:

  • Statistics is a vast field, which you hint at in your question by mentioning causality, frequentist, and Bayesian reasoning. Causality as found in the work of e.g. Judea Pearl (graphical causal models and related techniques) is somewhat out of scope of “core” statistics (disclaimer - IANAS and so have little insight into modern Statistics curricula, but have taken this from Pearl himself and working through “standard” Stats textbooks), and at a minimum quite different from the statistical reasoning required to perform statistical inference, which is what your frequentist/Bayesian split might be concerned with. That split in itself can clearly lead you down a massive rabbit hole which might end in places that could be considered closer to a philosophy department than anything else. I suppose this point goes some way to addressing your question (1) - Statistics as a field covers a lot of ground, which means that sub-fields are large enough to develop (and probably warrant) their own way of talking about the bit of Statistics they are concerned with day-to-day, which might limit “interoperability”.

  • I would also say that Statistics as a subject matter for the most part is just hard, at least where it concerns inference and related bits built on probability theory. Even abstracting from any Bayesian/frequentist arguments about the “right” (or best) way of approaching things, the underlying concepts are challenging and building intuition about them is hard, or even impossible (echoing John’s point above - sometimes an “Intuitive” approach just doesn’t cut it). One case in point is the evidence around widespread misinterpretation of confidence intervals even by statistics educators (see e.g. here). While one might be tempted to put this down to the specific challenges posed by the frequentist framework and say the solution is to go Bayesian (and I’d personally probably agree with this to some extent, at least on the specific question of confidence intervals), it still appears that even with more interpretable CIs humans tend to make errors in basing judgement on them (Andrew Gelman and co-authors have written quite a bit on this, see e.g. here a post from Jessica Hullman on his blog discussing the issue in the context of Bayesian models of the recent US election)

  • If you want to learn Statistics with Julia you might be interested in the book Statistics with Julia :wink: - that said there is also a great port of Statistical Rethinking in StatisticalRethinking.jl so that’s an option if you can live with the fact that the text itself uses R

  • I’m not sure there’s a good answer to your question (3) - ultimately this depends heavily on the personal circumstances in which you need to use Statistics. I had a very lacking statistical education in my undergrad and learned most of what I know now during my economics Master’s and PhD degrees. That means things for me were heavily tilted towards a frequentist approach focused around linear models, issues around what economists call “identification” (basically correctly estimating causal effects from observational, mainly panel data sets) and related issues of correctly estimating standard errors. I guess in hindsight of course it would have been great to have a more principled and rigorous approach as outlined by @johnmyleswhite above, but clearly there are opportunity costs to this and given that I’m an economist, not a statistician, I can live with the fact that I can’t prove the CLT off the top of my head, even if it means that I probably have to go back and consult All of Statistics (or my personal favourite, Casella and Berger) more frequently than I would have to if had built a much more solid foundation during my university years.

10 Likes

I find this text very enlightening on what concerns Bayesian statistics:

https://www.nature.com/articles/nbt0904-1177

5 Likes

If you have the time there is an excellent MOOC based MicroMaster on Statistics and Data Science at MITx…
Really great, from zero to “advanced basic”, most theory, but when they go to computational aspects, they use R/Python…

1 Like

Another great book that provides a tour from early statistics to modern methods is

https://web.stanford.edu/~hastie/CASI/

8 Likes

I agree with many of the points other posters have stated. In addition to those points, another source of difficulty might be the lack of agreement among statisticians. Oddly enough, the field of statistics has a long history of contention not only between frequentists and Bayesians, but also between statisticians within each framework. For example, Neyman and Fisher proposed different and incompatible approaches to frequentism. Unfortunately, many books that target non-statisticians present an incoherent amalgamation of Neyman’s and Fisher’s approaches.

In my experience, it can be a confusing process. What I found to be helpful was understanding limitations and assumptions with different approaches and reading a wide range of sources. Out of curiosity, what made you question your understanding?

3 Likes

I’ve always thought it is rather telling that the fundamentals of calculus were all pretty much worked out in the late 17th century, while the fundamentals of probability didn’t really get sorted until the early 20th century.

PS if we’re making textbook recommendations I’d go with William Feller’s textbooks for fundamentals of probability theory, and James Davidson for Econometrics/time-series/stochastic processes.

1 Like

Statistics is so hard to learn because it’s a branch of mathematics that people pretend isn’t a branch of mathematics and so they end up teaching it very poorly

Statistics is so hard to learn because it’s a branch of mathematics that people pretend isn’t a branch of mathematics and so they end up teaching it very poorly

@johnmyleswhite I have started to read All of Statistics and also introduction to mathematical statistics by Hogg and McKean. Although I find both books very soothing, and love to make those exercises/puzzles, at the end of the day I have a hard time applying it to my empirical research questions. I see you have done an empirical PhD as well. Do you have tips on how to apply the mathematical knowledge to do, lets say, hypothesis testing? (Ignoring the many papers arguing against hypothesis testing for now.)

Statistics as a field covers a lot of ground, which means that sub-fields are large enough to develop (and probably warrant) their own way of talking about the bit of Statistics they are concerned with day-to-day, which might limit “interoperability”.

@nilshg Good point. Probably, I should just appreciate the differences a bit more. Someone can be an expert in C kernel development without being able to understand Python code.

@Tamas_Papp I’ve heard Judea Pearl say that learning math should occur in a chronological fashion, so I’m gonna read your suggestion today! It sounds like a good foundation.

Good question. I think mainly because of two things:

  1. I often find myself skimming over parts of methodology/statistics papers that I don’t understand.
  2. Sometimes, when trying to reproduce the numbers reported in a paper, I get different numbers. At that point, I have no idea how to fix my analysis.

Yeah. Resolving discrepancies between your analysis and the analysis in the paper might be difficult or impossible. Some details of the analysis might be omitted due to page limitations, and sometimes seemingly small details can make a difference. A recent example that comes to mind is a person who had trouble translating a tutorial model in PyMC3 to Turing. It turns out that the issue was differences in the way that Python and Julia parameterize Gamma distributions.

2 Likes

The extent of it differs by field, but a lot of fields suffer from a reproducibility crisis, which may be a part of the problem. Eg a fun reading for economists:

4 Likes

On second thought, would a mathematical basis really be the best way to start with statistics nowadays? If the goal is to learn statistics with the goal of learning statistics, then probably yes. In most cases and for most people struggling with statistics, however, the goal isn’t to learn statistics but to apply it. Then, I would argue that the beauty of software is that you can abstract away many of the underlying details.

For example,

  1. Linus Torvalds and Guido van Rossum are both excellent programmers, and I’m sure that they both know a lot of low-level details. But do they know the required electronics, physics, mathematics and mechanical engineering to build a chip? Probably not.
  2. Many approaches in machine learning have not been mathematically proven. However, they are useful. (This topic is also touched upon in Tamas_Papp’s book suggestion, they talk about a distinction between algorithms and inference, and how algorithms are produced first and inference techniques for those algorithms after.)
  3. When I connect a back end to a database, I do not know all the underlying details of the database. Only if I notice that things do not work out, then I would dive deeper into the inner workings of the database system.

Based on these examples, I find it hard to believe that knowing the mathematical details is a requirement to do sound statistical inference. Yes, I do think that it can help in avoiding mistakes and that a lack of understanding the mathematics has lead me to the original question of this topic, but for day to day applications it’s not a necessity.

2 Likes

What is the specific challenge you’re having with applying the ideas in the hypothesis testing section of Wasserman? Hypothesis testing requires that you (a) define a hypothesis, (b) define a test statistic, (c) calculate the quantiles of the test statistic’s distribution under the hypotheses and (d) report the quantiles as p-values. Which is the part you’re struggling with?

I think you’ve successfully gotten to the heart of the matter, but you’ve ended up with a fairly unsafe conclusion.

Doing statistics correctly is much messier than querying databases or programming in C. A database lets you issue simple queries that deterministically produce simple results back – and this means that you can immediately learn how variations in the queries you author lead to variations in the outputs, which makes it relatively easy master the cause and effect relationship between your actions and the outcomes you get. But statistics is mostly not like that: if you produce a wrong p-value, how will you know? What’s going to be the feedback cycle connecting your trials and your errors to enable trial and error learning? All of the things you’re describing produce fairly easily detectable errors and do so deterministically – which means that learning to do them well is substantially easier than learning to do statistics well. And all of that is in addition to the fact that C and databases are relatively well-defined abstractions that mostly let you ignore the lower level implementation – but this isn’t true of statistical software. Learning how to call an existing t-test function lets you avoid understanding how to write that function, but does not let you avoid understanding how to interpret the output. The functions abstract over the software details, but not over the mathematical details.

10 Likes

I think that I had Bayesian model fitting and comparison like in Statistical Rethinking in my mind, and said hypothesis testing. Anyway, you’re right. Hypothesis testing is explained by Wasserman.

Just came here to write that :point_up: - the problem with incorrect application of statistics is that there’s no way of knowing that you’re doing it wrong, unless you are only dealing in prediction tasks, which relates back to your point about “many ML techniques have not been proven”.

The only reason this works in ML is that you can verify easily that your MSE or equivalent metric of choice is lower than that produced by some other method, that’s just not possible for computing a confidence interval, as no confidence interval is numerically “better” than any other, it’s either derived correctly or incorrectly.

3 Likes

First of all, I appreciate the discussion here a lot. Thanks :smiley:

I think common sense, domain knowledge and skepticism can mitigate statistical errors just as well as a mathematical understanding. Even someone who knows all there is to know about statistical mathematics could make a mistake when doing a statistical analysis. For example, that person could not have the domain knowledge to recognize that the predictor variable has been reversed by accident. Things like this actually happen; I’ve seen a paper being retracted for this reason.

Not all software bugs cause “fairly easily detectable errors”. For example, nondeterministic errors in software occur when applying multi-threading or distributed systems. To put it differently, there are situations where you have “no way of knowing that you’re doing it wrong” like when working on large distributed systems. Some bugs will only bring systems down in real world production scale. This doesn’t mean that none of these bugs are found and avoided in the first place. These bugs can be avoided by using intuition, common sense and tools. Using great visualizations would be one such tool, and that is also a great way to know that “you’re doing it wrong” in statistics.

Yes, but also not perfectly. There are situations where bugs are caused by changes in the details like timings of abstractions which cause other systems to fail. Then, you notice that the system fails and you can investigate the cause further. I could also graphically estimate means with a kernel density estimator and then detect that a t-value is wrong, while I do not know the details of the t-test function. So, without knowing the mathematical details, mistakes can be spotted by being skeptical and doing multiple complementary analyses.

With these things in mind, I would say that mathematics is a tool to avoid mistakes in analyses. It probably is an excellent tool, but not the only one.

The main reason, I think, that I am arguing against mathematics (don’t get me wrong, I love mathematics) is that its just hard to ever really get started with statistics if you require yourself to fully understand all the mathematical details behind it. Maybe at one point you finally understand all there is to understand about frequentists statistics, but then someone comes along and tells you that you are better off applying Bayesian statistics or any other technique which is favorable that day. Should this person then read an introductory book on the topic to understand its details? I think this person would be able to avoid 99% of the pitfalls after reading a few Wikipedia pages on the topic.

EDIT: This discussion also has a lot of parallels with the formal proving discussions around programming. If it was up to Dijkstra, we would be formally proving a lot more aspects of our programs. Unfortunately, at the end of the day, we need to get some systems going and better a system which works most of the times than a formally proven and understood system that is never finished.

this question feels really relatable! earlier this year, i made a career transition from software developer to machine learning engineer, and i’ve had to learn quite a bit of statistics to catch up.

so, i can’t speak as an expert in the field – but, as someone else also in the process of learning, here’s a few experiences that seem relevant to your questions:

  • there’s a noticeable divide in the literature, between sources founded on “traditional” 20th-Century statistical methods ( e.g. “Applied Linear Statistical Models”, by Kutner, Nachtsheim, Neter, and Li ) and newer sources aligned with the emergence of “data science” and “machine learning” ( e.g. “The Elements of Statistical Learning” by Hastie, Tibshirani, and Friedman ). the latter definitely build upon the work of the earlier – but it’s not always obvious what is relevant, or how. for example, traditional texts seem to spend a lot of time on methods for quantifying uncertainty, while more modern sources often approach this problem from the angle of model selection. certainly, there’s overlap between those approaches, but i know that i have personally struggled to prioritize topics in terms of how much insight they’ll provide into my own work.

  • some folks in the above have already pointed this out, but it’s worth echoing: the statistical literature does not seem to be terribly consistent in its adherence to mathematical rigor. this lack is complicated by the fact that, like many academic and advanced technical domains, authors are often writing with the presumption that the audience is fluent in a specific and fairly substantial set of concepts – which aren’t always identified clearly, and without which, many claims will seem opaque. it’s not always easy to tell which principles are being offered without rigorous justification, and which have a rigorous justification that is worked out elsewhere. that kind of second guessing really slows down study.

  • if i were to do go back and start differently, i would:
    – get guidance from an actual expert on where to start
    – narrow my focus to one very elementary foundational topic at a time ( e.g. “what is a probability distribution?”, or, “what is conditional probability?” )
    – complement the study with some kind of concrete implementation project

as it happens, that’s what i’ve ended up doing, although, not before wasting several months drifting from one topic to the next, trying to figure out what mattered. i’m fortunate enough to work closely with a authentic statistical wizard, and the points above came together from his advice to develop intuition by writing my own implementation of sampling of a random variable from a given parametric distribution, which really did develop a lot of insight.

to that end: Julia seems like a great language to do it in!

5 Likes

While I love math, it is important to keep in mind that for most of statistics, it is a tool, not an end in itself.

For example, the question asked by Bayesian statistics is really simple: having seen the data, what do I learn about the parameters? You can demonstrate the basic principle with simple examples like the physicist’s twins. It is only practical inference for large-dimensional models with computational challenges that gets very technical. There are excellent Bayesian books, like Gelman and Hill (2006), without a single mathematical theorem or proof in them.

The question asked by hypothesis testing is equally simple, conceptually. But the tradition that evolved around answering it invested a lot of machinery in giving asymptotic/approximate answers that eventually just reduce to a lookup in the tabulation of a few canonical distributions — this probably happened because that school of thought precedes the computer era.

I would recommend that you familiarize yourself with the principles of various approaches first (the Efron-Hastie book is great for that), and then once you find one you are interested in, think about investing in the related technical/mathematical toolkit.

14 Likes