I work often with DataFrames and missing values. It never annoyed me that I had to use skipmissing
or dropmissing
in a few places. In fact, I like the current default, as it forces me to be aware of what I really want to do with missing values. If skipmissing
would feel too verbose at some point, I would probably go with something like const sm = skipmissing
, as others have mentioned above.
I’m not going so far as to say that MY use case is the most common one. In fact, my post became split from a post about what people didn’t like about Julia. For my use cases, missing values are uninformative 99% of the time.
Let me show you where I come from. Imagine a dataframe where you imported some CSVfile. This provides col1
, which is a numeric column interpreted as a string (for example, if they report “NA” for non-available data)
This is what you need to take the mean of that col restricted to non-negative numbers.
df.col2 = passmissing(tryparse).(Int, df.col1)
isnonnegative(a,b) = (ismissing(a) == 0) && (a >= b)
df.col3 = df.col2[isnonnegative.(df.col2, 0), :]
df.col4 = mean(skipmissing(df.col3))
Rather than having that as default, I’d prefer to have
df.col2 = tryparse.(Int, df.col1)
df.col3 = df.col2[df.col2 .≥ 0, :]
df.col4 = mean(df.col3)
It’s not also about being sloppy in terms of data analysis. I think it’s great to have the option of having safe code for cases where missing data is not what you want. That’s why I like that values with NaN
report NaN
for mean. They point out that I’m doing 0/0
without noticing it. But, in my work, missing values show up in a different context.
The example also shows that the only potential solution is Julia 2.0 to add a new type of missing specialized for data analysis. But this would require DataFrames adopting it, which requires a lot of work and agreement about what the community thinks the default behavior should be.
Using macros or shortening the name doesn’t solve the problem, because it’s not about skipmissing
as the example shows. It’s how missing are treated structurally.
Also, shortening names of functions is not desirable. I prefer that new users read skipmissing
and know what that means, rather than wondering what sm
means (e.g., R with their na.rm
).
mean(filter(x -> !ismissing(x) && x >= 0,df.col1))
I see that you mention parsing… and that tryparse returns nothing when it can’t parse, not missing.
so I guess you’d do either this for a one-shot:
mean(filter(x-> !ismissing(x) && !isnothing(x) && x >= 0, map(x-> tryparse(Int,x), df.col1)))
or
df.col2 = map(df.col1) do(x)
a = tryparse(Int,x)
isnothing(a) ? missing : a
end
mean(filter(x -> !ismissing(x) && x >= 0, df.col2))
The point is not if we can write it more briefly. The point is that after two hours of work and 500 lines of code, it’s like someone keeps reminding you “but what about if you have missing values?” when you were already aware of that. It makes it harder to both write and read the code.
In fact, you’re rewriting avoids the first step and simply gathers two steps in one. The relevant comparison would then be:
df.col2 = passmissing(tryparse).(Int, df.col1)
df.col3 = mean(filter(x -> !ismissing(x) && x >= 0,df.col2))
vs
df.col2 = tryparse.(Int, df.col1)
df.col3 = mean(filter( ≥(0), df.col2))
In any case, I think that ultimately boils down to what people commonly use. There’s no right or wrong here.
I’m guessing he was imagining a case where CSV.jl returns a column of type Union{String, Missing}
. Although if you have missingstring="NA"
, then I think CSV.jl will probably be able to return a column of type Union{Int, Missing}
instead.
I thought the point was indeed verbosity.
But usually, I follow the pattern of:
- Read the data raw from CSV
- “clean” the data, including parsing stuff from strings, replacing known bad values with known good values, or etc
- if I want to ignore missings, subset the entire data frame and continue processing, this may include subsetting multiple times for different queries.
- If I want to impute missings write a Bayesian model in which the missing data is a parameter
Basically by the time I’m done with the first 100 lines of data processing the “missing” issue is no longer there because I either have a subset that is relevant to my calculation that drops the missings out, or I’m imputing if needed.
For example, if “wage” is missing whenever people are not employed… then:
employed = @subset(mydata,:employed)
meanwage = mean(employed.wage)
The awkwardness around filtering with something like df.col3 = df.col2[df.col2 .≥ 0]
is alleviated by DataFramesMeta.@subset
, which follows SQL and filters out rows where the predicate returns missing
.
This can not be emphasized enough, almost ALL of my data munging is done in DataFramesMeta as god intended.
Adding to a bunch of already suggested solutions to make skipping missing easier.
Maybe a package like Missings/DataFrames/DataFramesMeta/… (don’t use either, not familiar with their interactions) could define a macro like @skipmissings
(or @sm
) that:
- Makes all aggregation functions it knows skip missings
- Transforms comparisons like
a < b
into!ismissing(a) && !ismissing(b) && a < b
- Changes whatever else semantics is wanted
If it existed, the advice to those who want such behavior would be just “slap @sm
at the beginning and that’s it”.
Btw, finding some nice unicode characters for skipmissing/passmissing would also be nice to shorten existing code!
Handling missing values is always tricky and I had been bitten in all directions. Generally, for
-
interactive use, I prefer low boilerplate functions with quick and dirty defaults, e.g.,
mean
just ignores missing. -
programming, I prefer explicit handling and functions with sane defaults, i.e., throw meaningful errors if I don’t handle missing values properly.
Macro solutions have already been mentioned. Another option might be a custom REPL which rebinds several stats functions to dwim (do what I mean) versions – even though it’s not always clear what that would be, e.g., for boolean indexing I encountered use cases where missing values should be skipped as well as ones where they needed to be kept.
I want to emphasize that the surveys I work with have 1000+ variables and we create maybe a hundred or more new variables derived from those. Creating a new subset every time we want to handle missing values is not feasible. It’s best to have a single data frame with 1000+ columns as a source of truth, like in Stata, than to have separate subsets. Hence the need for skipmissing
all the time.
This isn’t stuff I do every day, but I have done things like this (analysis of the ACS survey or consumer expenditure survey etc), and In cases like these I typically don’t work with simple summary statistics at all, and immediately move into Bayesian models of the thing I’m interested in. If I care enough about missing I do imputation, and if I don’t care about missing I just drop that data entirely, possibly with taking a probabilistic subsample anyway.
Like, how would you handle the case where wage is either missing because the person is unemployed, or missing because they are employed but failed to answer the question? I just find that skipmissing is almost never what I want.
Instinctively, my approach is “the mean represents the mean of the people who have non-missing data”, along with “check for correlates with missing values and run some regressions to show it’s not correlated with demographics etc.”. If there are issues, re-weight or something to reconstruct a particular sampling frame. Maybe the analysis at the end of the day is re-weighted, maybe it isn’t.
Or construct manski bounds with imputations for best + worst case scenarios.
This may heavily reflect a difference between a Bayesian viewpoint and a Frequentist viewpoint.
To me, the Bayesian parameter represents the mean of whatever that model refers to, and the data are just individual data points that inform the parameter.
If the tourist uses Pandas, it propagates missing values in binary addition, yet ignores them in data sums, so that adding is not like summing. There is also an experimental pd.NA
type that does some things differently, but is not finalized. If the tourist uses sklearn, then some classifiers fail on missing and some do not. If they use Numpy, there are years of Numpy enhancement proposals summarized here, which are in deferred status since 2012, thus unresolved. Meanwhile there is np.nansum
, but someone using np.sum
would need to switch them over, just as they would in Julia with NanStatistics.jl
.
I’m not sure it would inspire much confidence in Julia’s commitment to make data analysis feel ergonomic.
I don’t think this is fair. Even Python, despite its maturity, has not fully resolved how to handle missings, and each package has its own behaviors. This is not to diminish @adienes’s desire for easier solutions, just to point out that it’s not easy to find one that everyone agrees with, since there are lots of different applications.
I agree with this. In fact, the only solution I see would be creating some data type with unsafe_missing
like @lmiq was suggesting. This would let both missing
and unsafe_missing
coexist.
However, this would require a redesign of multiple packages. Hence, Julia’s community should ultimately decide whether it’s worth the effort, according to how pervasive it is (I have a hunch it’s a generalized sentiment, but it’s just a conjecture).
I’d also anticipate that the implementation wouldn’t be so simple. For instance, when you create a column based on a df’s subset, the values not belonging to the subset are filled with missing
. In this sense, it requires deciding which type of missing should be taken as the default behavior in dfs.
All other solutions I can think of hurt more than help. For instance, InMemoryDataSets
attempts to solve this by doing type piracy, which I think is the worst possible solution. Shortening names does not really solve the problem (besides, you can always implement shortcuts on your own if needed, which I do when I’m prototyping).
I second this. I do data science with Julia and love this feature. I work with a lot of different data sets and a very common issue I come across when working with financial data is that, for one reason or another, when an amount is zero, it shows up missing in the data set. When that happens, I know I need to replace the missing
values with 0.00
and my column should be Float64
. If those zeros were excluded, it would totally wreck any and all statistics computed for a lot of those data elements…
I often just define my own parsing function in the top of my script/notebook that is customized for the current analysis. This is fully transparent, concise, and flexible.
Absolutely. A macro @sm
and maybe a shortened function sm
are the only real solution to these complaints. We just need to make it easy.
@adienes if you look back at your statement:
It was heavily outvoted in discussion, if hearts count for anything. So from this small survey it seems people don’t actually share that opinion. And its a breaking change, so its not going to happen. Can we focus on something actionable going forward.
Spinoff feature: With skipdissenting
, the data says everyone agrees.