Why are missing values not ignored by default?

  1. not all those hearts are from data scientists 2. data scientists overwhelmingly don’t use Julia in the first place (perhaps in part because they don’t want to write skipmissing 100 times a day?), so I think the heart count is a complete non-sequitur to my assertion

And its a breaking change, so its not going to happen

this part is true, very unfortunately.

Compared to the size of the Python user base, nobody uses Julia.

On the other hand I specifically use Julia for data analysis for a number of reasons, the first one is that DataFrames and DataFramesMeta are just vastly superior to the tidyverse and its intense reliance on fexpr which is ultimately incredibly unusable for anyone who isn’t a naive non-programmer (ie. For actual programmers). The second is that Julia itself is fast and I can write simulations directly in Julia that run 100x faster than R or Python. The third is that Julia developers come from a consistent point of view that I agree with and one of those points of view is that you never silently change the meaning of stuff under the hood.

In Julia, If you want to change the meaning of stuff we have an explicit mechanism… Macros, which rewrite code. If something is supposed to have a special meaning it gets called out with the @ symbol. This is tremendously helpful for understanding what code does.

The benefit of macros over fexpr was hashed out 40 years ago ``Special Forms in Lisp'' by Kent Pitman (August, 1980) I think Kent was right, and R/tidyverse is demonstrably wrong.

Another thing I think is demonstrably wrong is a system that silently ignores missing values rather than explicitly ignores them when specifically told. This is why na.rm=F is the default in R. Julia and R are both correct about this.

On the other hand, I acknowledge that if you want to write a block of code where you always ignore missing, then it would be annoying to have to type skipmissing each time.

Enter Julia’s explicit mechanism for this… The Macro! I don’t have a macro written but with MacroTools.jl it should be a few tens of lines to write an @skipm macro to insert skipmissing into all arguments of the functions listed in a given list of functions.

@skipm funcs expr

With perhaps a default set of funcs so you can leave that out most of the time.

Candidates for funcs might be things like
[:mean, :sum, :median,:quantile,:>,:<,:==]

I’m sure you can think of more…

This is the explicit mechanism that Julia has created to enable you to change the meaning of code. That’s what macros do, they alter the semantics of code by transforming it to other code. This is explicitly one of the reasons I use Julia because it gets this right!

My suspicion is that @adienes doesn’t know macro programming. This isn’t intended as a dig or denigration or anything. It’s just that if he knew macro programming he’d probably have just written the macro! So this is a great opportunity to dig into macro programming, because it’s the exact mechanism that Julia creates for solving this problem!

5 Likes

It would be helpful if you could provide specific examples in R or Python where missing handling is better than in Julia.

2 Likes

in polars:

df.select(pl.corr(col1, col2))

vs in Julia you have to write your own

function missing_cor(a,b)
    mask = (!).(ismissing.(a) .|| ismissing.(b))
    cor(a[mask], b[mask])
end

combine(df, [:col1, :col2] => missing_cor)
1 Like

Stealing directly from Bogumil’s blog, there is already a somewhat nice interface for this in StatsBase:

julia> using Statistics, StatsBase

julia> df = DataFrame(x=[1, missing, 3, 4], y = [4, 3, missing, 1])
4×2 DataFrame
 Row │ x        y
     │ Int64?   Int64?
─────┼──────────────────
   1 │       1        4
   2 │ missing        3
   3 │       3  missing
   4 │       4        1

julia> pairwise(cor, eachcol(df), skipmissing=:pairwise)
2×2 Matrix{Float64}:
  1.0  -1.0
 -1.0   1.0

Another option is to use skipmissings from Missings.jl:

julia> x=[1, missing, 3, 4]; y = [4, 3, missing, 1];

julia> smx, smy = skipmissings(x, y);

julia> cor(collect(smx), collect(smy))
-1.0

Note that Polars currently doesn’t even give you the option of propagating missing values in pl.corr, so if you wanted propagation by default for safety, then you would be out of luck. (I often don’t realize that I have missing values in some columns until I get a missing result from a calculation.)

Base Julia provides the foundations for handling missing values. And I think Base Julia made the correct decision by following the three-valued logic that R and SQL use. It’s perfectly feasible for Base Julia and packages to build on that foundation to make missing value handling as ergonomic as possible. We already have some of that with the following:

  • Base.coalesce
  • Base.skipmissing
  • Missings.passmissing
  • Missings.skipmissings
  • StatsBase.pairwise
  • DataFrames.subset(...; skipmissing=true)
  • DataFramesMeta.@subset(...)
16 Likes

Let’s focus on the filtered mean. Sticking to Base Julia, the above can be rewritten in one line by using coalesce. First, let me define the input vector:

x = [-1, missing, 1, 2]

Now we just need one line of code:

mean(x[coalesce.(x .≥ 0, false)])

If you want to use DataFrames, you can use subset with the skipmissing keyword argument, like this:

# Input:
df = DataFrame(x=[-1, missing, 1, 2])

# Analysis code:
dfsub = subset(df, :x => ByRow(≥(0)); skipmissing=true)
mean(dfsub.x)
2 Likes

I do not doubt that Julia has any capability needed if the user is willing to type a little more. it is just clearly a debate about whether or not the “safety” is worth the “clunkiness”

(in fact, I think status quo is neither “safer” nor necessarily “clunky”, so I am just using the words to be generous to both viewpoints)

to me, the answer to that question is it is absolutely not worth, but to others apparently it is

I am not sure how much farther we can go in this debate since it is almost purely personal preference and Julia has already planted the flag down on its existing semantics. I just get prickly when my preferred behavior is referred to as “wrong,” “unsafe,” “negligent,” etc. because I strongly feel that it is none of those things; it is just a different design choice is all

1 Like

There is the following option:

cor(eachcol(dropmissing(df2[!,[:x1,:x2]]; view=true))...)

which is used in:

julia> using Random; Random.seed!(10);

julia> df2 = DataFrame([rand() < 0.3 ? missing : rand() for i in CartesianIndices((10,3))], :auto)
10×3 DataFrame
 Row │ x1               x2                x3             
     │ Float64?         Float64?          Float64?       
─────┼───────────────────────────────────────────────────
   1 │       0.0876651        0.85494           0.89628
   2 │ missing          missing           missing        
   3 │ missing                0.210642          0.80741
   4 │       0.536775         0.93207           0.759976
   5 │       0.607473         0.00306044  missing        
   6 │       0.574958         0.184803          0.760385
   7 │       0.0317978  missing                 0.727141
   8 │ missing                0.00605547        0.580357
   9 │ missing          missing           missing        
  10 │       0.0523389        0.597275    missing        

julia> cor(eachcol(dropmissing(df2[!,[:x1,:x2]]; view=true))...)
-0.5343481606122596

Not taking any side strongly. But it is very similar to other tradeoffs, such as Int overflow, view vs. copy, bounds checking. Solving these issues needs a careful tread or maybe even a benevolent decider.

I don’t want to drag this out, but can we at least agree that missing should propagate through comparison operators? That is what R, SQL, and Polars all do. Here is a Polars example:

In [10]: df = pl.DataFrame({"x": [42, None]})

In [11]: df.with_columns(y = pl.col("x") >= 0)
Out[11]:
shape: (2, 2)
┌──────┬──────┐
│ x    ┆ y    │
│ ---  ┆ ---  │
│ i64  ┆ bool │
╞══════╪══════╡
│ 42   ┆ true │
│ null ┆ null │
└──────┴──────┘

In [12]: df.with_columns(y = pl.col("x") >= 0).filter("y")
Out[12]:
shape: (1, 2)
┌─────┬──────┐
│ x   ┆ y    │
│ --- ┆ ---  │
│ i64 ┆ bool │
╞═════╪══════╡
│ 42  ┆ true │
└─────┴──────┘

Given that missing values should propagate through comparison operators, it makes sense that the skipping of missing values (returned by predicates) should happen during the filtering step. That is how it works in SQL, Polars, dplyr, and DataFrames.subset(...; skipmissing=true).

(I mean, if you are doing a filtering… Obviously sometimes you aren’t doing a filter.)

yes, we agree on that.

I think it got a little lost in the long thread, but I still currently stand by this one:

I would be in favor of adding skip-missing versions of Base aggregation functions to Missings.jl, e.g. smean, ssum, etc. They’re commonly enough needed that it would be nice to have them at your fingertips with a simple using Missings.

2 Likes

Perhaps it is time for a poll. Someone who has read this thread from the beginning, can make a question with possible solutions, or a couple of question (at most), and we can see where people stand.

There is a difference between changing the API at its root for everyone and changing the APIs for preference.

I’m still very confused why the majority of the conversation here is about changing some established API whereas Julia gives the user many facilities to customize the API both through packages and at the user level.

I’m particularly concerned about these comments. The Julia 2 reference is really a fanciful distraction. In reality, it is probably quite far away and ultimately will likely not contain the changes that you want. My expectation for Julia 2 is that it will only introduce some minor but really important breaking changes but keep functioning as is for the large majority of Julia 1 code.

More importantly, I’m not particularly convinced that this requires a breaking change to Julia or a fundamental change to how any of the packages mentioned work.

There are two approaches to customizing the API.

  1. Shadow the functions
  2. Introduce new types

Introducing new functions to shadow existing APIs

I’ve shown examples about how methods can be shadowed above. Essentially, this is reminiscent about how things are in Python. We emulate APIs by placing similarly named APIs in distinct namespaces. This is often discouraged in Julia because it by itself does not compose as well. However, it is significantly simpler and creates less risk of introducing compilation related issues such as invalidation. We can actually have it both ways in Julia by having core packages which implement their own APIs in separate namespaces and separate packages which overload Base methods and forward to the namedspaced versions.

Instead of mangling the name, why not just scope them into the module. They could be sm.mean and sm.sum? We could actually do both approaches. There could be a submodule that is meant for scoped names and another one with prefixed names, both pointing to the same underlying implementation.

Using Types to Modify the API

There are a few ways to introduce new types in order to customize an API. The one that seems to have been discussed above is to introduce a new kind of Missing. For UnsafeMissing I just want to point out that it is completely possible to implement that in a package.

Another approach for introducing new types is to create wrappers. In this example, we could wrap a DataFrame in order the change the behavior. This in turn could returned wrapped columns, which return wrapped or replaced missing values when indexed. This would allow us to effectively overlay our API preferences over the existing API.

Here’s an lightweight example of this.

Setup Code
julia> using CSV, DataFrames, Statistics


julia> struct SkipMissingDataFrame
           parent::DataFrame
       end

julia> Base.parent(smdf::SkipMissingDataFrame) = getfield(smdf, :parent)

julia> Base.getproperty(smdf::SkipMissingDataFrame, sym::Symbol) = skipmissing(Base.getproperty(parent(smdf), sym))

julia> write("blah.csv","""
       "col1", "col2"
       "5", "6"
       "1", "2"
       "30", "31"
       "22", "23"
       "NA"
       "50"
       """)
65
julia> df = CSV.read("blah.csv", DataFrame; silencewarnings=true);
julia> smdf = SkipMissingDataFrame(df)
SkipMissingDataFrame(6×2 DataFrame
 Row │ col1     col2    
     │ String3  Int64?  
─────┼──────────────────
   1 │ 5              6
   2 │ 1              2
   3 │ 30            31
   4 │ 22            23
   5 │ NA       missing 
   6 │ 50       missing )

julia> smdf.col2 |> mean
15.5

julia> smdf.col2 |> x->Iterators.filter(>(10),x) |> mean
27.0
3 Likes

We can of course bikeshed the name. To me, the name sm.mean looks like a property access on an object, so it should be SM.mean. But that looks ugly. :slight_smile:

It seems like the main objection to mean(skipmissing(x)) is that it is too verbose and too many keystrokes, so let’s make the abbreviation as short as possible:

  • smean
  • ssum
  • svar
  • scor

I’ve opened an issue here:

2 Likes

My thought is what if they just want mean to just be Missings.smean? I suppose they could do const mean = smean or using Missings.smean as mean.

1 Like

The really nice affordance that @mkitti is suggesting is that you can write using SkipMissings: mean at the very beginning and then it will shadow mean for the code you write.

5 Likes

smean seems fine as long as it is as performant and stable as Base.mean (which is currently not the case). cor and cov are the more painful cases so scor and scov more badly needed

I made the issue in DataFrames.jl here a few weeks ago feature request: allow `skipmissing` column types · Issue #3398 · JuliaData/DataFrames.jl · GitHub

since an alternate approach would be to allow columns to be of type SkipMissing, and then I could call skipmissing!(df) , analogous to the existing allowmissing!(df) and all aggregations would have the behavior I want

I think import SM: mean and then seeing, 100 lines down

mean(x)

is definitely code-smell. It’s not good to have different behavior so far from the call-site. I think a better scenario is skip(mean)(x). Where skip wraps a function and does pre-processing on x to ensure missings ar skipped. This has the benefits of not requiring re-writing lots of function definitions.

2 Likes

I’m not convinced that’s a good idea. It goes against the spirit of generic programming in Julia. The mean and smean functions are fundamentally different functions.

That’s not much different from mean(skipmissing(x)). Again, I think we want to go for the absolute minimum amount of typing, hence smean. Minimizing typing and verbosity is the main goal of these shortcut functions.

Also, as @adienes mentioned, the simplification provided by scor and svar is a big win. I’m not sure if skip(cor)(x) could provide that… maybe skip(cor; skipmissing=:pairwise)(x, y), but that seems awkward.

3 Likes

skip(cor) would indeed provide that. See here