Why are missing values not ignored by default?

Can you tell me a little bit more about how this new type works? What would you like it to do? Is it basically missing with different default behavior?

Here’s some code to start with along the lines of what I provided above.

Setup code for `MyMissing` and `MyMissingDataFrame`
julia> using CSV, Statistics, MappedArrays, DataFrames

julia> struct MyMissing end

julia> const mymissing = MyMissing()
MyMissing()

julia> struct MyMissingDataFrame # <: AbstractDataFrame (work in progress)
           parent::DataFrame
       end

julia> Base.parent(mmdf::MyMissingDataFrame) = getfield(mmdf, :parent)

julia> Base.getproperty(mmdf::MyMissingDataFrame, sym::Symbol) =
           mappedarray(x->ismissing(x) ? MyMissing() : x, Base.getproperty(parent(mmdf), sym))

julia> Base.ismissing(::MyMissing) = true

julia> Base.nonmissingtype(::Type{T}) where {T >: MyMissing} = Base.typesplit(T, MyMissing)
julia> mmdf = MyMissingDataFrame(df)
MyMissingDataFrame(6×2 DataFrame
 Row │ col1     col2    
     │ String3  Int64?  
─────┼──────────────────
   1 │ 5              6
   2 │ 1              2
   3 │ 30            31
   4 │ 22            23
   5 │ NA       missing 
   6 │ 50       missing )

julia> mmdf.col2
6-element mappedarray(var"#5#6"(), ::SentinelArrays.SentinelVector{Int64, Int64, Missing, Vector{Int64}}) with eltype Union{MyMissing, Int64}:
  6
  2
 31
 23
   MyMissing()
   MyMissing()

One of the issues with adding a new missing type is that the current missing is an integral part of three-valued logic (3VL) in Julia. Here’s the 3VL truth table for |:

julia> using DataFrames

julia> x = [true, true, true, false, false, false, missing, missing, missing];

julia> y = [true, false, missing, true, false, missing, true, false, missing];

julia> DataFrame(x = x, y = y, x_or_y = x .| y)
9×3 DataFrame
 Row │ x        y        x_or_y
     │ Bool?    Bool?    Bool?
─────┼───────────────────────────
   1 │    true     true     true
   2 │    true    false     true
   3 │    true  missing     true
   4 │   false     true     true
   5 │   false    false    false
   6 │   false  missing  missing
   7 │ missing     true     true
   8 │ missing    false  missing
   9 │ missing  missing  missing

If we add newmissing, that truth table will explode in size—we would now be dealing with four-valued logic (4VL). That sounds like a nightmare to me.

To clarify, the three values referred to in 3VL are three logical values: true, false, and unknown. Julia follows SQL and R and cheats by representing unknown as missing. (This has some subtle issues, since unknown == unknown should return true, but missing == missing returns missing.) A four-valued logic would have a fourth logical value, like newunknown. But continuing with the missing tradition, a 4VL logical vector would be of type Union{Boolean, Missing, NewMissing}.

1 Like

For instance, these are some functions I’d apply for the types of missing I have in mind.

image

Ideally, my code would be more readable if I could directly use expressions like x .> 0 instead. In addition, functions like sum, mean, cor, etc should also ignore the type of missing.

In any case, the implementation may be so complicated that it’d be too hard to do. I imagine there must be lots of special cases that should be thought of, unintended consequences I’m not considering, breaking composability with other packages, etc.

Imho, there is not much difference between using a type providing additional methods vs shadowing existing methods. In both cases, the underlying problem is that there are two reasonable default choices of how missing values are to be treated and we need some way to configure code to select one of them:

# Pick any of these methods to configure your analysis
# 1. Import shadowing methods
using Missings.SkipMissing: mean
# 2. Convert data to new type
df = MyMissingDataFrame(df)
...
# some analysis much later in the file
mean(df.whatever)

In any case, functions with the same name are configured to behave differently. More local approaches are also conceivable, i.e., using macros or blocked definitions as syntactic markers:

let mean = mean ∘ skipmissing
     ...
     mean(df.whatever)
end
# or macro based
@with_missing_skipped begin
    ...
    mean(df.whatever)
end

In the end, it’s mainly about defaults and how easily – also in terms of readability – these can be changed.

1 Like

yes, when I said we should wait for Julia 2.0, I was trying to convey the idea that a new type of missing is maybe too complex and affects the fundamentals of Julia. So that maybe there’s no solution, explaining why Python and other languages had to made a decision about it, but none of them covers all cases.

I don’t see what you mean. 3VL is not a concept in Julia. Missing is just some type, we can introduce another type that behaves similarly. Why would we need 4VL instead of just Union{Bool, NewMissing}?

This must be some sort of record: almost 150 posts about something that is not there!

12 Likes

This is not too hard actually. If I understand your code, the binary operator always returns false if a missing is present, right?

Also, I think you could have written these more succintly.

ishigheq(a, b) = !ismissing(a) && a >= b
ishigher(a, b) = !ismissing(a) && a > b
islower(a, b)  = !ismissing(a) && a < b
isloweq(a, b)  = !ismissing(a) && a <= b

edit: fixed the inversion, thanks to @Benny

Anyways, that said, here are the two additional binary operator definitions that we need.

julia> Base.:<(a::MyMissing, b) = false

julia> Base.:<(a, b::MyMissing) = false

Here’s a demo.

julia> 5 < mymissing
false

julia> mymissing < 5
false

julia> 5 > mymissing
false

julia> 3.0 >= mymissing
false

julia> 5 <= mymissing
false

julia> mmdf.col2 .< 5
6-element BitVector:
 0
 1
 0
 0
 0
 0

julia> mmdf.col2 .> 5
6-element BitVector:
 1
 0
 1
 1
 0
 0

julia> mmdf.col2 .<= 6
6-element BitVector:
 1
 1
 0
 0
 0
 0

Actually it is. See this section of the manual. Quoting from that section:

Logical (or boolean) operators |, & and xor are another special case since they only propagate missing values when it is logically required. For these operators, whether or not the result is uncertain, depends on the particular operation. This follows the well-established rules of three-valued logic which are implemented by e.g. NULL in SQL and NA in R.

In other words, the correct propagation of missing has already been defined for the logical operators |, &, xor, and !. Presumably we would also need to expand the truth tables to include newmissing, if it was introduced.

1 Like

I’m trying really, really hard to disabuse of you this notion. Julia was built to do exactly this. Fundamental to the design of Julia 1 is the ability to extend operators to new types. We do not need a Julia 2 to do this.

3 Likes

This is not required. You can simply require that the user choose one or the other, and never define any methods for mixing the two. The idea being that the type of missing is injected into the data early on (preferably when reading the file) and never changed.

1 Like

I don’t think we need to define Base.|(::Missing, ::NewMissing) etc. The answer isn’t obvious and people should specify what they mean when they’re mixing these together.

“The user should decide” is the current behavior, however. I don’t see how this solution improves things.

2 Likes

I’m saying the user should decide what should happen when combining Missing with NewMissing. But normally my data won’t have NewMissing and @adienes data won’t have Missing, it’s usually one or the other – mixing them will be less common.

I think it is important not to pile too many features onto any of the proposals to keep them actionable. The main complaint has been that default behavior should be different, so in my view one first-order solution is a package which provides a type that is a simple replacement of Base.missing and has all the desired semantics. Basically, do in a package what you would have done in Base if you had been in charge before Julia v1. There is nothing special about things defined in Base.

3 Likes

It sounds like one of the main goals of newmissing would be to make mean(x) and sum(x) “just work”. But in order for these functions to “just work”, they would need special handling of newmissing. For example, a naive implementation of mean might look like this:

function mean(itr)
    sum = 0
    n = 0
    for x in itr
        if x !== newmissing
            sum += x
            n += 1
        end
    end
    sum/n
end

Thus, only some aggregation functions would “just work”. Other aggregation functions that do not have special handling for newmissing would not work. This is basically the same situation as defining smean, ssum, etc, but at least with smean we are not defining a new missing data type.

I’m not so sure a new type will help much. because after all, CSV.jl or lag will fill empty values with missing in a way I cannot control, so I’d have to coalesce.(input, newmissing) everywhere anyway

1 Like

All those libraries are gonna have to grow a missingvalue argument.

that’s exactly one of the features I was thinking. It leads me to think that the implementation of a new type is too complex. It’ll probably break composability with other packages also.

And the other possible solutions look more like patches, not actually reflecting a structural solution.

Is it me or does this not do anything like skipmissing? For one, it only skips the comparison to a default false if a is missing, it’ll propagate missing b. For another, it’s a scalar operation, skipmissing removes missing elements from collections.

Creating data indistinguishable from real data for missing a and treating missing b normally is too specific, it just doesn’t seem like something that should be handled by a type, macro, or derivative functions with such a blanket setting. This could instead be accomplished by processing a collection of a values (!ismissing).(A) before doing anything with b, perhaps even lazily by some value substitution wrapper.

Should be !ismissing(a) because abool == 0 inverts abool.

1 Like