Why are missing values not ignored by default?

I love Julia. The only feature that I don’t like is the treatment of missing values. It really hinders your work when working with data analysis.

Comparison operators like > and ==, and aggregate functions like sum and mean will return missing if there is at least one missing value in the dataset. So you need to write mean(skipmissing(x)) rather than mean(x), or isequal(x,y) rather than x .== y.

I understand that this approach was intended to ensure code safety. In fact, I think returning NaN when there is at least one NaN value in x is useful in general, as it potentially points out an unintended result. However, missing values only arise when you work in data analysis, and you always know whether there are missing values in your dataset. For me, it’s never been true that mean(x) returning missing is pointing out any unintended result.

Overall, I would have preferred skipmissing to be the default approach. Ultimately, it’s a matter of preference. However, if you’re like me, handling missing values feels like a drag after two hours of serious work, deviating your attention all the time from the actual analysis.

5 Likes

I agree 100% with every point you have made

beyond mean(skipmissing(x)) it gets even more painful with cor, when one must write skipmissing twice…

I cannot help but feel that the missing semantics were designed by those who (while well-intentioned) do not actually do data science work themselves

5 Likes

I am kind of curious how to do mean when there are missing values. Say x has N elements and M missing values, the subset of x contains no missing values is y. Which mean do you like: sum(y) / N or sum(y) / (N-M)?

1 Like

this one of course

1 Like

It’s not only mean, but any function that reduces a vector to a scalar.

Example:

x = [0, 1, missing]
sum(x)              # this gives you `missing`
sum(skipmissing(x))

Similarly, some functions return error if there’s a missing value. Say you have a dataset that it loaded numbers as strings. To convert them to numbers, you have to use passmissing, otherwise it errors if there’s a missing value

x = ["0", "1", missing]
tryparse.(Int, x)                  # this errors
passmissing(tryparse).(Int, x)

To avoid these problems, you also end up writing a lot of code like this for a dataframe df, so that you work with a subset of values that are not missing:

temp = view(df, (!ismissing).(:var1), :)
#or 
temp = dropmissing(df, :var1, view=true)

I understand that writing this out all the time is frustrating. But if you know what semantics you need, you can convert your data to use a custom type right after loading it, and then gradually build a package with the dispatches you need as you hit those MethodErrors.
Libraries for loading data could be made to accept a custom type for missing data.

3 Likes

“sure, Julia’s behavior for basic statistical operations is atypical and clunky, but you can always rewrite your own libraries!”

I’m not trying to be too too snarky, but just imagine reading that rebuttal as a Python-wielding tourist to this thread. I’m not sure it would inspire much confidence in Julia’s commitment to make data analysis feel ergonomic.

7 Likes

What then should in your opinion be the proper behavior of the sum function in this case? Silently skip the missings?

1 Like

How are these missing values coded? Is there a package that allows arrays with missing values? I’m curious, as I’ve always handcoded this: one array with actual data, and another coding whether an element in the first array should be used or not.

And especially in this connection may I strongly disagree with your following sweeping statement?

9 Likes

yup

1 Like

No, not for me!

Especially of you sum positive values (a common case) you will silently get not only definitively wrong result, but even a wrong estimation. For an estimation I’d replace each missing by a mean or median of available values.

But first of all that should be my responsibility, and the responsibility of the software is to compute correctly. Which it does now.

15 Likes

Thanks for the info: i didn’t know that ‘missing’ was part of the language.

I agree that having to use skipmissing() all the time would get annoying. Knowing myself, I’d probably write my own library for mean, stdev, etc. that hides all of the skipmissing() verbosity and ugliness.

8 posts were split to a new topic: Off topic asides and snark on missing’s docs

it’s not “wrong” it’s just a design choice

sometimes the user might want to use behavior different than default design choice, in which case one can coalesce missings to a different value

that is indeed the user’s responsibility and default behavior should be whatever is most standard and ergonomic, and that is to skip missings

Well, that’s your priorities, your personal opinion about what is most standard, and your expectation of how the sum function must behave. For me, if some summands are unknown, then the sum is simply unknown as well.

27 Likes

agreed

nonetheless, I would bet that these opinions and priorities are shared by sizeable majority of those doing data science work

especially for stuff like cor it quickly becomes a nightmare

note how describe has the nice skipmissing behavior

2 Likes

I doubt that, but could be. For me it would be another confirmation that data science is not science :stuck_out_tongue:. In any case I would bet that these opinions and priorities are NOT shared by sizeable majority of those doing (proper) science and engineering work.

See also the pool results in the “The Zen of Missing in Julia

9 Likes

this feels pretty condescending and dismissive of a ton of work by a ton of well-educated and trained professionals. not sure if you intended it to come off this way?

4 Likes

The solution here is pretty simple, write a macro to insert the skipmissings.