Why are missing values not ignored by default?

alfaromartino · November 26, 2023, 1:45am

I love Julia. The only feature that I don’t like is the treatment of missing values. It really hinders your work when working with data analysis.

Comparison operators like > and ==, and aggregate functions like sum and mean will return missing if there is at least one missing value in the dataset. So you need to write mean(skipmissing(x)) rather than mean(x), or isequal(x,y) rather than x .== y.

I understand that this approach was intended to ensure code safety. In fact, I think returning NaN when there is at least one NaN value in x is useful in general, as it potentially points out an unintended result. However, missing values only arise when you work in data analysis, and you always know whether there are missing values in your dataset. For me, it’s never been true that mean(x) returning missing is pointing out any unintended result.

Overall, I would have preferred skipmissing to be the default approach. Ultimately, it’s a matter of preference. However, if you’re like me, handling missing values feels like a drag after two hours of serious work, deviating your attention all the time from the actual analysis.

adienes · November 26, 2023, 1:56am

I agree 100% with every point you have made

beyond mean(skipmissing(x)) it gets even more painful with cor, when one must write skipmissing twice…

I cannot help but feel that the missing semantics were designed by those who (while well-intentioned) do not actually do data science work themselves

liuyxpp · November 26, 2023, 2:06am

I am kind of curious how to do mean when there are missing values. Say x has N elements and M missing values, the subset of x contains no missing values is y. Which mean do you like: sum(y) / N or sum(y) / (N-M)?

adienes · November 26, 2023, 2:18am

this one of course

alfaromartino · November 26, 2023, 2:18am

It’s not only mean, but any function that reduces a vector to a scalar.

Example:

x = [0, 1, missing]
sum(x)              # this gives you `missing`
sum(skipmissing(x))

Similarly, some functions return error if there’s a missing value. Say you have a dataset that it loaded numbers as strings. To convert them to numbers, you have to use passmissing, otherwise it errors if there’s a missing value

x = ["0", "1", missing]
tryparse.(Int, x)                  # this errors
passmissing(tryparse).(Int, x)

To avoid these problems, you also end up writing a lot of code like this for a dataframe df, so that you work with a subset of values that are not missing:

temp = view(df, (!ismissing).(:var1), :)
#or 
temp = dropmissing(df, :var1, view=true)

simsurace · November 26, 2023, 2:41pm

I understand that writing this out all the time is frustrating. But if you know what semantics you need, you can convert your data to use a custom type right after loading it, and then gradually build a package with the dispatches you need as you hit those MethodErrors.
Libraries for loading data could be made to accept a custom type for missing data.

adienes · November 26, 2023, 3:54pm

“sure, Julia’s behavior for basic statistical operations is atypical and clunky, but you can always rewrite your own libraries!”

I’m not trying to be too too snarky, but just imagine reading that rebuttal as a Python-wielding tourist to this thread. I’m not sure it would inspire much confidence in Julia’s commitment to make data analysis feel ergonomic.

Eben60 · November 26, 2023, 4:42pm

What then should in your opinion be the proper behavior of the sum function in this case? Silently skip the missings?

mpeters2 · November 26, 2023, 4:44pm

How are these missing values coded? Is there a package that allows arrays with missing values? I’m curious, as I’ve always handcoded this: one array with actual data, and another coding whether an element in the first array should be used or not.

Eben60 · November 26, 2023, 5:01pm

And especially in this connection may I strongly disagree with your following sweeping statement?

adienes · November 26, 2023, 5:06pm

yup

Eben60 · November 26, 2023, 5:17pm

No, not for me!

Especially of you sum positive values (a common case) you will silently get not only definitively wrong result, but even a wrong estimation. For an estimation I’d replace each missing by a mean or median of available values.

But first of all that should be my responsibility, and the responsibility of the software is to compute correctly. Which it does now.

mpeters2 · November 26, 2023, 5:17pm

Thanks for the info: i didn’t know that ‘missing’ was part of the language.

I agree that having to use skipmissing() all the time would get annoying. Knowing myself, I’d probably write my own library for mean, stdev, etc. that hides all of the skipmissing() verbosity and ugliness.

mbauman · November 26, 2023, 8:48pm

8 posts were split to a new topic: Off topic asides and snark on missing’s docs

adienes · November 26, 2023, 5:22pm

it’s not “wrong” it’s just a design choice

sometimes the user might want to use behavior different than default design choice, in which case one can coalesce missings to a different value

that is indeed the user’s responsibility and default behavior should be whatever is most standard and ergonomic, and that is to skip missings

Eben60 · November 26, 2023, 5:30pm

Well, that’s your priorities, your personal opinion about what is most standard, and your expectation of how the sum function must behave. For me, if some summands are unknown, then the sum is simply unknown as well.

adienes · November 26, 2023, 5:34pm

agreed

nonetheless, I would bet that these opinions and priorities are shared by sizeable majority of those doing data science work

especially for stuff like cor it quickly becomes a nightmare

note how describe has the nice skipmissing behavior

Eben60 · November 26, 2023, 5:46pm

I doubt that, but could be. For me it would be another confirmation that data science is not science . In any case I would bet that these opinions and priorities are NOT shared by sizeable majority of those doing (proper) science and engineering work.

See also the pool results in the “The Zen of Missing in Julia”

adienes · November 26, 2023, 5:48pm

this feels pretty condescending and dismissive of a ton of work by a ton of well-educated and trained professionals. not sure if you intended it to come off this way?

dlakelan · November 26, 2023, 5:56pm

The solution here is pretty simple, write a macro to insert the skipmissings.

Topic		Replies	Views
What workflows for missing values are more ergonomic in Julia? Internals & Design	2	372	November 30, 2023
How does StatsBase.skewness work? Data	29	2617	January 29, 2019
A modest `missing`s 2.0 proposal Data	20	1192	October 31, 2020
Missing or NaN General Usage	26	12338	August 1, 2018
DataFramesMeta.jl and the state of the DataFrames ecosystem Data	36	4027	April 24, 2020

Why are missing values not ignored by default?

Related topics