This just doesn’t work when you want to use multiple packages and each package maintainer had a different preference right?
I think what @alfaromartino wants here are predicates for filtering that also remove the missings, to enable things like x[ishigher.(x, b)]
or filter(a -> ishigher(a, b), x)
. That way you no longer need to nest with skipmissing
, as you would today with skipmissing(x[x .> b])
or filter(>(b), skipmissing(x))
That’s the thing, those scalar operations don’t skip missings because they propagate one argument’s missings and can’t remove elements like skipmissing
does, which will matter for operations that depend on the collection’s length e.g. replacing missing
s with false
s instead of skipping will affect mean
. It seems like disparate actions are being muddled here just because they all hande missing
s.
Can anyone point to a publicly available (maybe govt produced) moderate size dataset with missing values and some suggested analyses that could be done with them so we can have a kind of “playground”?
I can probably think of some with some research but maybe someone has a really good example along the lines of a manageable version of @pdeffebach’s comment:
Maybe not 1000+ variables, but something with a few tens of thousands of rows and a hundred ish columns and liberal use of missing?
julia> using DataFrames
julia> using ShiftedArrays: lag
julia> df = DataFrame(rand(100, 2), :auto);
julia> df.x1_lag1 = copy(lag(df.x1));
julia> df.x2_lag2 = copy(lag(df.x2, 2));
now compute a cor
of x1_lag1
and x2_lag2
without pulling your hair out
and for a subset of lags restricted to positive values!
Shouldn’t you provide a different default
value for lag
then?
no?
sorry if that’s curt, but I just don’t really see how that helps. a default value would not be appropriate
Yeah it’s just not clear to me what the desired outcome is from manually introducing different numbers of missing
s into different columns and performing an operation over them.
to get a correlation of one feature lagged to another
this is super-duper-giga-extremely common in building signals on financial data
I’m not a data scientist, but is there more to it than this?
julia> nomissingdf = dropmissing(df);
julia> using Statistics
julia> cor(nomissingdf.x1_lag1, nomissingdf.x2_lag2)
0.13004606550437073
For other filtering steps (e.g., positive values only), apply them to nomissingdf
as you would otherwise?
I’m aware of what correlations are, but I definitely don’t work with financial data. I just don’t understand why you’re introducing missing
s into it. Is there no standard practice for shifted data, like zero-padding or excluding non-overlapping ends?
now imagine you have a bunch more features and a bunch more lags, you do a diagonal join, etc. etc.
if you dropmissing(df)
you will have 0 rows. cor
should only drop the rows used for that calculation, not every single row in the entire table that contains any missing
@Benny this basically is “excluding non-overlapping ends,” I just don’t want to do that by hand over and over and over
Just in case, the matter is not whether you can perform operations like this. You can. In fact, you could simply do:
df2 = dropmissing(df, [:x1_lag1, :x2_lag2], view=true)
cor(df2.x1_lag1, df2.x2_lag2)
Note that writing dropmissing(df)
drops more missing than what you need. And if you also want to modify the original df, you need to add the view = true
.
But every two lines of code you write, you have to think about missings for some types of work where missings are uninformative. It gets extremely tiring after a few hours.
Julia has always been about making the code as simple as possible, indicating how you caa implement in 10 lines what other languages require 100 lines of code. So, this is the case where the opposite happens, where some code that should have 500 lines ends up being 1000 lines full of skipmissing
, passmissing
, dropmissing
, own defined functions of ishigher
rather than .>
, etc.
But, again, I’m not sure there’s an easy solution.
I think this is what is hard to convey. I can understand why some people might feel unsympathetic when there are so many ~clear and easy~ solutions to this problem. but after the 1000th time the language says “neener neener, didja think about missing???” and having to reply “yes, I promise, it’s ok” with 11 unnecessary characters, it gets exhausting
I mean, you have a specific use case where you are going to repeatedly do a similar thing over and over handling missing in a way that’s consistent for you… So write a function that does that?
smcor()
or something
This doesn’t seem different to me than having a particular kind of plot you want to do over and over and writing a particular plotting function to do it.
right, and now we’re back to “if you like polars
's missing semantics so much why don’t you just go rewrite your own statistical libraries”
I’m not saying you’re entirely wrong, but this is definitely not a particularly user-friendly response to a touring data scientist who just wants to get some work done
Anyone who wants scor
in Missings.jl should chime in here:
I really doubt it, too. Sure it’s annoying to assemble building blocks so often, and it’s also annoying to make custom functions to do the repeated patterns quicker. But it doesn’t make sense to distribute many versions of the same thing with different defaults and processing.
About the cor
example, it makes sense to drop the non-overlapping missing
ends to compute a correlation. It could also make sense in some contexts to make the defaults zeros in the first place. If the data had a bunch of missing
in the middle, it could make sense to not drop those and propagate the missing
to indicate that the data is too poorly recorded to be useful. Do they all deserve their own version of cor
or their own version of Statistics
? We can only introduce so many new things before it becomes bloat, and even then it won’t cover much.
Ironically, I think this is one place where all the code sharing in Julia is detrimental. numpy.mean
and polars.mean
have different behaviors and each makes total sense within their contexts