Why are missing values not ignored by default?

This just doesn’t work when you want to use multiple packages and each package maintainer had a different preference right?

I think what @alfaromartino wants here are predicates for filtering that also remove the missings, to enable things like x[ishigher.(x, b)] or filter(a -> ishigher(a, b), x). That way you no longer need to nest with skipmissing, as you would today with skipmissing(x[x .> b]) or filter(>(b), skipmissing(x))

That’s the thing, those scalar operations don’t skip missings because they propagate one argument’s missings and can’t remove elements like skipmissing does, which will matter for operations that depend on the collection’s length e.g. replacing missings with falses instead of skipping will affect mean. It seems like disparate actions are being muddled here just because they all hande missings.

Can anyone point to a publicly available (maybe govt produced) moderate size dataset with missing values and some suggested analyses that could be done with them so we can have a kind of “playground”?

I can probably think of some with some research but maybe someone has a really good example along the lines of a manageable version of @pdeffebach’s comment:

Maybe not 1000+ variables, but something with a few tens of thousands of rows and a hundred ish columns and liberal use of missing?

1 Like
julia> using DataFrames

julia> using ShiftedArrays: lag

julia> df = DataFrame(rand(100, 2), :auto);

julia> df.x1_lag1 = copy(lag(df.x1));

julia> df.x2_lag2 = copy(lag(df.x2, 2));

now compute a cor of x1_lag1 and x2_lag2 without pulling your hair out

1 Like

and for a subset of lags restricted to positive values!

Shouldn’t you provide a different default value for lag then?

no?

sorry if that’s curt, but I just don’t really see how that helps. a default value would not be appropriate

1 Like

Yeah it’s just not clear to me what the desired outcome is from manually introducing different numbers of missings into different columns and performing an operation over them.

to get a correlation of one feature lagged to another

this is super-duper-giga-extremely common in building signals on financial data

1 Like

I’m not a data scientist, but is there more to it than this?

julia> nomissingdf = dropmissing(df);

julia> using Statistics

julia> cor(nomissingdf.x1_lag1, nomissingdf.x2_lag2)
0.13004606550437073

For other filtering steps (e.g., positive values only), apply them to nomissingdf as you would otherwise?

2 Likes

I’m aware of what correlations are, but I definitely don’t work with financial data. I just don’t understand why you’re introducing missings into it. Is there no standard practice for shifted data, like zero-padding or excluding non-overlapping ends?

now imagine you have a bunch more features and a bunch more lags, you do a diagonal join, etc. etc.
if you dropmissing(df) you will have 0 rows. cor should only drop the rows used for that calculation, not every single row in the entire table that contains any missing

@Benny this basically is “excluding non-overlapping ends,” I just don’t want to do that by hand over and over and over

Just in case, the matter is not whether you can perform operations like this. You can. In fact, you could simply do:

df2 = dropmissing(df, [:x1_lag1, :x2_lag2], view=true)

cor(df2.x1_lag1, df2.x2_lag2)

Note that writing dropmissing(df) drops more missing than what you need. And if you also want to modify the original df, you need to add the view = true.

But every two lines of code you write, you have to think about missings for some types of work where missings are uninformative. It gets extremely tiring after a few hours.

Julia has always been about making the code as simple as possible, indicating how you caa implement in 10 lines what other languages require 100 lines of code. So, this is the case where the opposite happens, where some code that should have 500 lines ends up being 1000 lines full of skipmissing, passmissing, dropmissing, own defined functions of ishigher rather than .>, etc.

But, again, I’m not sure there’s an easy solution.

1 Like

I think this is what is hard to convey. I can understand why some people might feel unsympathetic when there are so many ~clear and easy~ solutions to this problem. but after the 1000th time the language says “neener neener, didja think about missing???” and having to reply “yes, I promise, it’s ok” with 11 unnecessary characters, it gets exhausting

3 Likes

I mean, you have a specific use case where you are going to repeatedly do a similar thing over and over handling missing in a way that’s consistent for you… So write a function that does that?

smcor() or something

This doesn’t seem different to me than having a particular kind of plot you want to do over and over and writing a particular plotting function to do it.

1 Like

right, and now we’re back to “if you like polars 's missing semantics so much why don’t you just go rewrite your own statistical libraries”

I’m not saying you’re entirely wrong, but this is definitely not a particularly user-friendly response to a touring data scientist who just wants to get some work done

Anyone who wants scor in Missings.jl should chime in here:

3 Likes

I really doubt it, too. Sure it’s annoying to assemble building blocks so often, and it’s also annoying to make custom functions to do the repeated patterns quicker. But it doesn’t make sense to distribute many versions of the same thing with different defaults and processing.

About the cor example, it makes sense to drop the non-overlapping missing ends to compute a correlation. It could also make sense in some contexts to make the defaults zeros in the first place. If the data had a bunch of missing in the middle, it could make sense to not drop those and propagate the missing to indicate that the data is too poorly recorded to be useful. Do they all deserve their own version of cor or their own version of Statistics? We can only introduce so many new things before it becomes bloat, and even then it won’t cover much.

1 Like

Ironically, I think this is one place where all the code sharing in Julia is detrimental. numpy.mean and polars.mean have different behaviors and each makes total sense within their contexts