Why are missing values not ignored by default?

dlakelan · November 29, 2023, 1:47am

Right, what should this do?

a=[1,2,missing,3,4]
b = lag(a)
cor(a,b)

adienes · November 29, 2023, 1:51am

the behavior of missing_cor here Why are missing values not ignored by default? - #106 by adienes

Benny · November 29, 2023, 1:51am

They differ because of the data types, not because of how they handle certain values. The latter is much more diverse.

dlakelan was implying that there isn’t just 1 way to handle middle missings. Someone else would write a different function from you.

mkitti · November 29, 2023, 1:55am

I’ve submitted the following pull request to make it easier to create Missing-like types.

aplavin · November 29, 2023, 1:58am

@adienes or others who want this behavior, could you please tell what convenience issues would remain if some package (Missings, MissingStatistics, whatever) defined all common aggregation functions (Statistics + StatsBase) that skip missings automatically? So that the user just had to type scor/SM.cor/bikeshed instead of cor.

Seems like best of both worlds: missing handling that is both explicit, local, and without noticeable typing overhead.

dlakelan · November 29, 2023, 1:59am

Suppose you’re measuring the intensity of light from physical particles experiments or something, normally it’s a small number determined by thermal noise but every so often there’s a big flash… Sometimes that big flash can induce a saturation and failure to measure and then a missing is emitted. Your two detectors give the following measurements

a= [1,2,1,3,missing,3,1,4]
b=[2,1,3,1,856459,1,2,1]

Missing just means something COMPLETELY different here from the financial or employment survey or whatever context.

Benny · November 29, 2023, 2:07am

R’s cor takes the approach of having a couple keywords with a lot of settings. Not only do they handle missing values, a couple specify whether you want to throw errors, that it anticipates that some missing value patterns is so horrendous in some use cases that a user would repeatedly type that keyword to catch bad data. Even then, these settings don’t process all patterns of missing values in every possible way so you’ll have to do it separately anyway, which I would usually prefer instead of keyword settings.

adienes · November 29, 2023, 2:08am

I can understand and appreciate the value of this type of missing

but maybe the real complaint is that missing arises too easily? it seems all the arguments for status quo treat it as an “unobserved observation.” but then CSV.jl when reading a truncated line will fill the rest with missing, which makes an assumption that those values should have been valid observations but were otherwise unobserved

perhaps those should fill to nothing ? and same for lag and diagonal joins

Benny · November 29, 2023, 2:10am

You could replace missing with nothing i.e. in lag’s default, I don’t see how it’ll help because we still have to handle the not-actual-data, and nothing throws errors instead of propagating, which seems harder to work with.

danielwe · November 29, 2023, 2:19am

The original blog post that introduced Julia’s missing explicitly recommends using missing for “data scientist’s null” and nothing for “software engineer’s null”: First-Class Statistical Missing Values Support in Julia 0.7. I think this is more or less what @adienes is suggesting.

dlakelan · November 29, 2023, 2:21am

This is why I say that the Bayesian viewpoint may give a really different view on how missing should be handled.

In my example I would want to estimate the actual flash intensity, I would likely do that by building a model and in that model the non missing data would enter into the estimate while the missing data would be imputed.

For the lagged example I’d probably impute the missing values by repetition or by generating a random number of a certain type.

Benny · November 29, 2023, 2:21am

Yeah, that’s what I mean by errors vs propagation. The labels aren’t very appropriate in this thread because we have data scientists who are very vocal about not wanting to propagate missings, well some of them anyway. Whichever we use, the error vs propagation upon operations aspect isn’t the end-all-be-all, we can evidently handle either in many many ways before and after the operations. I’m not sure replacing missing-handling with nothing-handling with more thrown errors is the easy direction.

dlakelan · November 29, 2023, 2:42am

It’s why I’d really like to see an example real world dataset, because I can’t think of an example where skipmissing is that big of a deal, either the missing are benign and a small fraction of the data, or I want to do imputation or modeling. There just isn’t much in between.

adienes · November 29, 2023, 3:03am

it only takes one…

danielwe · November 29, 2023, 3:34am

This is a bit of a tangent, but is it common to add lagged columns to your dataframe for computing lagged correlations, and is this a main source of missings in practical data science? Again, I don’t do a lot of data analysis, but my instinct here would be to reach for signal processing functions from StatsBase:

julia> using StatsBase

julia> lags = [1, 3, 10];

julia> crosscor(df.x1, df.x2, lags)
3-element Vector{Float64}:
  0.019144023607046545
 -0.09897572568100832
 -0.0324592679530158

I appreciate the desire to reduce boilerplate when dealing with actually missing data from incomplete survey responses, faulty sensors, et cetera, but for the structural missings from lags and joins, it seems like it would be better to avoid them altogether? Or at least use nothing instead of missing to emphasize their structural nature, and have a concise idiom built into DataFrames.jl for skipping over nothing in the obvious way in arbitrary multivariate aggregations.

dlakelan · November 29, 2023, 5:10am

Right, but if it’s a small number often it’s ignorable you can just delete those observations from the dataset and move on, and if it’s a large number it needs modeling… it’s the intermediate case where there’s like more than you’d be happy just dropping and yet maybe you want some quick and dirty results without doing the modeling of proper imputation… I still think there’s probably a simple imputation method that might be worthwhile in most cases. Like for each missing select a random non-missing element from the vector, or linearly interpolate, or some such.

It’s part of why I’m interested in seeing some real world data because I’m wondering if there isn’t a solution that involves having a small set of simple imputation tools which could be applied to the data

When it comes to the structural stuff like the lag example I’m with @danielwe I think using the right tool for the job instead of building it by hand. crosscor, autocor, and related functions, which are also probably way more space efficient and have other advantages compared to doing stuff by hand adding lagged columns to a data frame.

I guess one question is how much of the issues you have with missing comes from assumptions about the way to do things that come from ecosystems where different patterns are common.

I’m not trying to minimize your struggles more like wrap my head around why I haven’t experienced similar struggles?

adienes · November 29, 2023, 1:32pm

if it’s a large number it needs modeling

nope! not always

here is another very real example for me. I have said the word “diagonal join” a few times but maybe it’s not clear what I mean by that. in this situation, a table is used to flatten out updates corresponding to the same market order in a financial trading application.

in this dummy example, it may appear that I should just do a groupby(:order_id) and combine to get only one row for each with order_px and fill_px. but in the general case, this is not really appropriate, since

a single order may have a large number of independent fills
the relative arrival order of these updates matters quite a lot and almost all the analysis will care about rolling statistics over each update sequentially

However, I still might want to understand things like

what is my mean fill rate per order
what is my mean slippage per order
etc…

which gets a lot more painful if I have to write various missing-wrangling things every 5 characters
(side wish: would absolutely love to have Over in DataFrames for situations like these expression functors (in particular: `over`) · Issue #3377 · JuliaData/DataFrames.jl · GitHub)

julia> df
12×4 DataFrame
 Row │ order_id  status  order_px  fill_px   
     │ Int64     Symbol  Int64?    Float64?  
─────┼───────────────────────────────────────
   1 │        1  NEW          101  missing   
   2 │        2  NEW          102  missing   
   3 │        1  ACK      missing  missing   
   4 │        3  NEW           98  missing   
   5 │        4  NEW           97  missing   
   6 │        2  ACK      missing  missing   
   7 │        1  FILL     missing      101.1
   8 │        2  FILL     missing      101.9
   9 │        3  ACK      missing  missing   
  10 │        4  ACK      missing  missing   
  11 │        4  FILL     missing       98.2
  12 │        3  FILL     missing       96.8

Dan · November 29, 2023, 1:51pm

The missing here:

could easily be just 0.0 (every NEW order would have zero fill I suppose).
As a sidenote, it may be messy to handle this data, but handling a “log-file” of a complex system such as a continuous-time double-auction service isn’t easy. Perhaps until enough infrastructure is built by people sweating the details and making good Packages.

adienes · November 29, 2023, 1:58pm

tell me about it

but it could be easier if I had a few tools and behaviors like I get in the python dataframe libs

BioTurboNick · November 29, 2023, 2:01pm

Seems like the root issue is there not being a distinguishing between “value unknown” and “no value”. Is there concretely something that DataFrames/Julia could do to provide that for you?

Topic		Replies	Views
What workflows for missing values are more ergonomic in Julia? Internals & Design	2	363	November 30, 2023
Compute mean of array where all values could be missing New to Julia	5	390	April 21, 2021
DataFrames, aggregate with missings Data dataframes	2	559	May 4, 2020
Using `isnan()` with missing values leads to hard to find bugs General Usage	6	515	April 12, 2020
Missing of a certain data type General Usage	5	485	February 15, 2019

Why are missing values not ignored by default?

Related topics