I love Julia. The only feature that I don’t like is the treatment of missing values. It really hinders your work when working with data analysis.
Comparison operators like > and ==, and aggregate functions like sum and mean will return missing if there is at least one missing value in the dataset. So you need to write mean(skipmissing(x)) rather than mean(x), or isequal(x,y) rather than x .== y.
I understand that this approach was intended to ensure code safety. In fact, I think returning NaN when there is at least one NaN value in x is useful in general, as it potentially points out an unintended result. However, missing values only arise when you work in data analysis, and you always know whether there are missing values in your dataset. For me, it’s never been true that mean(x) returning missing is pointing out any unintended result.
Overall, I would have preferred skipmissing to be the default approach. Ultimately, it’s a matter of preference. However, if you’re like me, handling missing values feels like a drag after two hours of serious work, deviating your attention all the time from the actual analysis.
I am kind of curious how to do mean when there are missing values. Say x has N elements and M missing values, the subset of x contains no missing values is y. Which mean do you like: sum(y) / N or sum(y) / (N-M)?
It’s not only mean, but any function that reduces a vector to a scalar.
x = [0, 1, missing]
sum(x) # this gives you `missing`
Similarly, some functions return error if there’s a missing value. Say you have a dataset that it loaded numbers as strings. To convert them to numbers, you have to use passmissing, otherwise it errors if there’s a missing value
x = ["0", "1", missing]
tryparse.(Int, x) # this errors
To avoid these problems, you also end up writing a lot of code like this for a dataframe df, so that you work with a subset of values that are not missing:
I understand that writing this out all the time is frustrating. But if you know what semantics you need, you can convert your data to use a custom type right after loading it, and then gradually build a package with the dispatches you need as you hit those MethodErrors.
Libraries for loading data could be made to accept a custom type for missing data.
“sure, Julia’s behavior for basic statistical operations is atypical and clunky, but you can always rewrite your own libraries!”
I’m not trying to be too too snarky, but just imagine reading that rebuttal as a Python-wielding tourist to this thread. I’m not sure it would inspire much confidence in Julia’s commitment to make data analysis feel ergonomic.
How are these missing values coded? Is there a package that allows arrays with missing values? I’m curious, as I’ve always handcoded this: one array with actual data, and another coding whether an element in the first array should be used or not.
Especially of you sum positive values (a common case) you will silently get not only definitively wrong result, but even a wrong estimation. For an estimation I’d replace each missing by a mean or median of available values.
But first of all that should be my responsibility, and the responsibility of the software is to compute correctly. Which it does now.
Thanks for the info: i didn’t know that ‘missing’ was part of the language.
I agree that having to use skipmissing() all the time would get annoying. Knowing myself, I’d probably write my own library for mean, stdev, etc. that hides all of the skipmissing() verbosity and ugliness.
Well, that’s your priorities, your personal opinion about what is most standard, and your expectation of how the sum function must behave. For me, if some summands are unknown, then the sum is simply unknown as well.
I doubt that, but could be. For me it would be another confirmation that data science is not science . In any case I would bet that these opinions and priorities are NOT shared by sizeable majority of those doing (proper) science and engineering work.