Why are missing values not ignored by default?

CameronBieganek · November 29, 2023, 2:53pm

SQL has the same issue, since they also have only one type of NULL. The “value unknown” and “no value” that you are referring to are sometimes called “Missing But Applicable” and “Missing But Inapplicable”. Unfortunately, if you have two separate missing types for “Missing But Applicable” and “Missing But Inapplicable”, you end up with four-valued logic. Quoting from Wikipedia:

Codd indicated in his 1990 book The Relational Model for Database Management, Version 2 that the single Null mandated by the SQL standard was inadequate, and should be replaced by two separate Null-type markers to indicate the reason why data is missing. In Codd’s book, these two Null-type markers are referred to as ‘A-Values’ and ‘I-Values’, representing ‘Missing But Applicable’ and ‘Missing But Inapplicable’, respectively. Codd’s recommendation would have required SQL’s logic system be expanded to accommodate a four-valued logic system.

yvikhlya · November 29, 2023, 3:02pm

import Statistics: mean

mean(x::Array{<:Union{Missing,Any}}) = mean(skipmissing(x))
aa=[1,missing,3,4,5]
mean(aa)
3.25

How difficult is this?!

pdeffebach · November 29, 2023, 3:06pm

This makes it hard to read and maintain a codebase. The reader might not realize that mean in this context means something different than mean in some other context. That’s why we are searching for a behavior that is understood by everyone.

dlakelan · November 29, 2023, 3:09pm

I think this is exactly the reason I haven’t had as much difficulty as @adienes in that many of the cases where I encounter missing it’s really missing but applicable.

When I have inapplicable I think I tend to restructure the data. For example subset just the NEW and FILL types and coalesce the two price columns into a single price column. Voila, everything is applicable!

adienes · November 29, 2023, 3:16pm

I would encourage you to humor me and imagine that my actual, real data is a bit more complicated than this

e.g. orders may fill many times, a variable number of times. I do not want to have fill_px_1, fill_px_2, … fill_px_i all as separate columns

or there may be other features in the data which are applicable to some order types and not others. or maybe one symbol is being traded passively and the other aggressively, and I want to analyze both in the same table, etc. etc.

pdeffebach · November 29, 2023, 3:18pm

I want to re-iterate what @CameronBieganek is saying. A new missing type would be a bad idea.

Four-valued logic and handling interactions between different missings leads to a combinatorial increase in methods
Package authors now have to choose whether to support one of two types of missing, leading to needing to convince everyone what the exact right use of every missing is in package code.
Even if we have an AbstractMissing type, and package authors write code according to an AbstractMissing, then we are still back to square one where we replace this discussion on missing with a discussion on AbstractMissing.

As we’ve seen from this thread, people are intransigent in the way they perceive and want to handle missing data and have tons of different opinions on how they should propagate or be skipped. Having a single missing type has saved us from large coordination failures where everyone introduces their own missing type with their own semantics.

Coordination with behavior that we don’t all agree on is preferable to a lack of coordination but where each missing package author has missings behaving the way they want to.

dlakelan · November 29, 2023, 3:23pm

I’m not claiming anything about your situation, I’m just discussing how I tend to address those kinds of situations.

I did actually start my career in financial data analysis 20 years ago. The kind of situation you’re describing is familiar to me.

Even if your orders were filled multiple times I would be doing the same thing…

OrderID, time, orderType, Price

And order by orderID and time… And then essentially have a whole bunch of time series, where every entry is applicable.

In other words, whether or not you agree with what I do, my own preference is to analyze data without any non-applicable entries and without any explosion of extra columns.

Benny · November 29, 2023, 3:38pm

How and which non-data is handled seems to vary too much by use case to distribute packages for all of them, there would just be too many niche ones. Again and again we see that different missings may be handled differently, even within the same program, and handling that by expanding non-data to more types is problematic because there isn’t a fixed set of non-data behaviors to standardize, beyond operations throwing errors (nothing) or propagating (missing) by default. It just seems like in this topic where there is evidently no consensus for most things, each organization should build and document its own standard for processing data in their isolated context. That can still be distributed in packages, it’s just rather than dozens of variants of “OnlyRightWaytoDataScience.jl”, it’d be specific contexts like “CanadianHotDogReviews.jl”.

mihalybaci · November 29, 2023, 4:35pm

Apologies for adding nothing to the disucssion, but “Candian Hot Dog Reviews” sounds like a blog I’d really enjoy reading.

mkitti · November 29, 2023, 5:08pm

Meanwhile, on Github…

https://github.com/JuliaLang/julia/pull/44407 is planned to merge in a few days.

Dan · November 29, 2023, 6:18pm

I think we can see a Pattern here, which is actually healthy. It is similar to the Pattern seen in the Vega project from Washington University:

A well thought-out and comprehensive framework is developed for a certain workload. (original Vega, or Julia Stats ecosystem in this case).
As it matures, and gathers users, there is a need to get a lot of useful defaults baked in for ease-of-use. (VegaLite project, or the upcoming CanadianHotDogReviews.jl or QuickStats.jl package in this case).
The ease-of-use project gathers a big user base (often eclipsing the original project).
An attempt to make life even easier by automating the whole workload (Voyager project in Vega, or a future AutoStatistics.jl).

A similar pattern can also be seen in Makie - AlgebraOfGraphics - TiderPlots. Overall, this is a healthy development and if all the frontier (from the base framework to the automatic workload package) is maintained, everybody wins.

gvdr · November 29, 2023, 9:27pm

It is not false, according to classic logic at least. Being false implies that the missing value is greater or equal to 100. Which is clearly not the case.

adienes · November 29, 2023, 9:47pm

according to classic logic at least

not sure where you heard this, but I think “classic logic” makes no such demands on the specifications of < for an object like missing

basically < is only a partial order over Union{Number, Missing} which means that !(a ≤ b) does not have to imply that b ≤ a. note that this distinction is also exactly the distinction between partial orders and total orders

but anyway from a pragmatic point of view, I agree it makes the most sense to have operators propagate missing. I would just really, really, really appreciate it if this were recognized as a design CHOICE and not something that must be done or is somehow more objectively correct

ParadaCarleton · November 29, 2023, 9:50pm

Just thought of something: What about creating separate MCAR, MAR, and MNAR objects, and only ignoring MCAR by default when computing statistics?

gvdr · November 29, 2023, 9:56pm

Mmm, not to be dismissive at all, but:

If A is false, then not A is true.

You can find that in any logic textbook.

If x < 100 is false, than !(x < 100) must be true. And !(x < 100) is x >= 100.

I don’t think this is controversial.

We can introduce a non standard logic for missing, but it would be non standard (and require a lot of care in handling).

adienes · November 29, 2023, 9:57pm

And !(x < 100) is x >= 100.

this part is controversial, yes. that is only true if < and >= form a total order. which crucially, missing does not

gvdr · November 29, 2023, 10:06pm

That is correct. I thought we wanted to at least preserve the strictness of relations such as < (even if not it’s totality). Thinking about having both < and =< being non strict breaks my brain.

mkitti · November 29, 2023, 10:12pm

Could you expand your abbreviations, please? I do not know what you are talking about.

Dan · November 29, 2023, 10:14pm

MCAR = Missing Completely at Random
MAR = Missing at Random
MNAR = Missing Not at Random

Terms from statistics. Definitely important when trying to get meaningful unbiased results with missing values.

bertschi · November 29, 2023, 11:09pm

Don’t get the exact requirements of your use case (maybe you want to share a small example analysis later on), but Julia is very flexible and gives you lot of options – besides porting the workflow you are used to (which seems to rely on different defaults though):

Functional combinators can be handy in some cases:

combine(groupby(df, :order_id), :fill_px => mean∘skipmissing)
nomiss(x, y) = let mask = @. !ismissing(x) && !ismissing(y)
                   (x[mask], y[mask])
               end
combine(some_df, [:x, :y] => splat(cor)∘nomiss)

Depending on your preferences, this might still be to verbose though.

Domain modelling via custom data types, e.g., in the order example it might make sense to define suitable data structures and convert the data:
```
struct Order
     fill::Vector{FillInfo}
     ...
end
struct FillInfo
    ...
end
# Convert data frame and handle missing on order creation
orders::Vector{Order} = [Order(skipmissing(g.fill_px), ...) for g in df.order_id]
```
In contrast to R or Python, working with custom data types is a real alternative in Julia being convenient and efficient

Topic		Replies	Views
What workflows for missing values are more ergonomic in Julia? Internals & Design	2	363	November 30, 2023
Compute mean of array where all values could be missing New to Julia	5	390	April 21, 2021
DataFrames, aggregate with missings Data dataframes	2	559	May 4, 2020
Using `isnan()` with missing values leads to hard to find bugs General Usage	6	515	April 12, 2020
Missing of a certain data type General Usage	5	485	February 15, 2019

Why are missing values not ignored by default?

Related topics