Why are missing values not ignored by default?

dlakelan · November 29, 2023, 12:28am

This just doesn’t work when you want to use multiple packages and each package maintainer had a different preference right?

danielwe · November 29, 2023, 12:30am

I think what @alfaromartino wants here are predicates for filtering that also remove the missings, to enable things like x[ishigher.(x, b)] or filter(a -> ishigher(a, b), x). That way you no longer need to nest with skipmissing, as you would today with skipmissing(x[x .> b]) or filter(>(b), skipmissing(x))

Benny · November 29, 2023, 12:32am

That’s the thing, those scalar operations don’t skip missings because they propagate one argument’s missings and can’t remove elements like skipmissing does, which will matter for operations that depend on the collection’s length e.g. replacing missings with falses instead of skipping will affect mean. It seems like disparate actions are being muddled here just because they all hande missings.

dlakelan · November 29, 2023, 12:37am

Can anyone point to a publicly available (maybe govt produced) moderate size dataset with missing values and some suggested analyses that could be done with them so we can have a kind of “playground”?

I can probably think of some with some research but maybe someone has a really good example along the lines of a manageable version of @pdeffebach’s comment:

Maybe not 1000+ variables, but something with a few tens of thousands of rows and a hundred ish columns and liberal use of missing?

adienes · November 29, 2023, 12:41am

julia> using DataFrames

julia> using ShiftedArrays: lag

julia> df = DataFrame(rand(100, 2), :auto);

julia> df.x1_lag1 = copy(lag(df.x1));

julia> df.x2_lag2 = copy(lag(df.x2, 2));

now compute a cor of x1_lag1 and x2_lag2 without pulling your hair out

alfaromartino · November 29, 2023, 12:45am

and for a subset of lags restricted to positive values!

Benny · November 29, 2023, 12:52am

Shouldn’t you provide a different default value for lag then?

adienes · November 29, 2023, 1:03am

no?

sorry if that’s curt, but I just don’t really see how that helps. a default value would not be appropriate

Benny · November 29, 2023, 1:12am

Yeah it’s just not clear to me what the desired outcome is from manually introducing different numbers of missings into different columns and performing an operation over them.

adienes · November 29, 2023, 1:15am

to get a correlation of one feature lagged to another

this is super-duper-giga-extremely common in building signals on financial data

danielwe · November 29, 2023, 1:18am

I’m not a data scientist, but is there more to it than this?

julia> nomissingdf = dropmissing(df);

julia> using Statistics

julia> cor(nomissingdf.x1_lag1, nomissingdf.x2_lag2)
0.13004606550437073

For other filtering steps (e.g., positive values only), apply them to nomissingdf as you would otherwise?

Benny · November 29, 2023, 1:19am

I’m aware of what correlations are, but I definitely don’t work with financial data. I just don’t understand why you’re introducing missings into it. Is there no standard practice for shifted data, like zero-padding or excluding non-overlapping ends?

adienes · November 29, 2023, 1:20am

now imagine you have a bunch more features and a bunch more lags, you do a diagonal join, etc. etc.
if you dropmissing(df) you will have 0 rows. cor should only drop the rows used for that calculation, not every single row in the entire table that contains any missing

@Benny this basically is “excluding non-overlapping ends,” I just don’t want to do that by hand over and over and over

alfaromartino · November 29, 2023, 1:22am

Just in case, the matter is not whether you can perform operations like this. You can. In fact, you could simply do:

df2 = dropmissing(df, [:x1_lag1, :x2_lag2], view=true)

cor(df2.x1_lag1, df2.x2_lag2)

Note that writing dropmissing(df) drops more missing than what you need. And if you also want to modify the original df, you need to add the view = true.

But every two lines of code you write, you have to think about missings for some types of work where missings are uninformative. It gets extremely tiring after a few hours.

Julia has always been about making the code as simple as possible, indicating how you caa implement in 10 lines what other languages require 100 lines of code. So, this is the case where the opposite happens, where some code that should have 500 lines ends up being 1000 lines full of skipmissing, passmissing, dropmissing, own defined functions of ishigher rather than .>, etc.

But, again, I’m not sure there’s an easy solution.

adienes · November 29, 2023, 1:25am

I think this is what is hard to convey. I can understand why some people might feel unsympathetic when there are so many ~clear and easy~ solutions to this problem. but after the 1000th time the language says “neener neener, didja think about missing???” and having to reply “yes, I promise, it’s ok” with 11 unnecessary characters, it gets exhausting

dlakelan · November 29, 2023, 1:33am

I mean, you have a specific use case where you are going to repeatedly do a similar thing over and over handling missing in a way that’s consistent for you… So write a function that does that?

smcor() or something

This doesn’t seem different to me than having a particular kind of plot you want to do over and over and writing a particular plotting function to do it.

adienes · November 29, 2023, 1:36am

right, and now we’re back to “if you like polars 's missing semantics so much why don’t you just go rewrite your own statistical libraries”

I’m not saying you’re entirely wrong, but this is definitely not a particularly user-friendly response to a touring data scientist who just wants to get some work done

CameronBieganek · November 29, 2023, 1:40am

Anyone who wants scor in Missings.jl should chime in here:

github.com/JuliaData/Missings.jl

Add convenience skip-missing methods for Base aggregation functions

opened 07:17PM - 28 Nov 23 UTC

CameronBieganek

xref: https://discourse.julialang.org/t/why-are-missing-values-not-ignored-by-de…fault/106756 There are some users who dislike `mean(skipmissing(x))` because they view it as too verbose or too many keystrokes. Aggregation functions that skip missing values are very commonly needed, so I think it would be worth adding convenience skip-missing methods for Base aggregation functions to Missings.jl, e.g. - `smean` - `ssum` - `svar` - `scor` The `svar` and `scor` functions could have a keyword argument to control exactly how missing values are skipped, similar to [`cor` in R](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/cor).

Benny · November 29, 2023, 1:42am

I really doubt it, too. Sure it’s annoying to assemble building blocks so often, and it’s also annoying to make custom functions to do the repeated patterns quicker. But it doesn’t make sense to distribute many versions of the same thing with different defaults and processing.

About the cor example, it makes sense to drop the non-overlapping missing ends to compute a correlation. It could also make sense in some contexts to make the defaults zeros in the first place. If the data had a bunch of missing in the middle, it could make sense to not drop those and propagate the missing to indicate that the data is too poorly recorded to be useful. Do they all deserve their own version of cor or their own version of Statistics? We can only introduce so many new things before it becomes bloat, and even then it won’t cover much.

adienes · November 29, 2023, 1:46am

Ironically, I think this is one place where all the code sharing in Julia is detrimental. numpy.mean and polars.mean have different behaviors and each makes total sense within their contexts

Topic		Replies	Views
What workflows for missing values are more ergonomic in Julia? Internals & Design	2	363	November 30, 2023
Compute mean of array where all values could be missing New to Julia	5	392	April 21, 2021
DataFrames, aggregate with missings Data dataframes	2	560	May 4, 2020
Using `isnan()` with missing values leads to hard to find bugs General Usage	6	520	April 12, 2020
Missing of a certain data type General Usage	5	486	February 15, 2019

Why are missing values not ignored by default?

Related topics