Why are missing values not ignored by default?

Right, what should this do?

a=[1,2,missing,3,4]
b = lag(a)
cor(a,b)
1 Like

the behavior of missing_cor here Why are missing values not ignored by default? - #106 by adienes

They differ because of the data types, not because of how they handle certain values. The latter is much more diverse.

dlakelan was implying that there isn’t just 1 way to handle middle missings. Someone else would write a different function from you.

1 Like

I’ve submitted the following pull request to make it easier to create Missing-like types.

2 Likes

@adienes or others who want this behavior, could you please tell what convenience issues would remain if some package (Missings, MissingStatistics, whatever) defined all common aggregation functions (Statistics + StatsBase) that skip missings automatically? So that the user just had to type scor/SM.cor/bikeshed instead of cor.

Seems like best of both worlds: missing handling that is both explicit, local, and without noticeable typing overhead.

8 Likes

Suppose you’re measuring the intensity of light from physical particles experiments or something, normally it’s a small number determined by thermal noise but every so often there’s a big flash… Sometimes that big flash can induce a saturation and failure to measure and then a missing is emitted. Your two detectors give the following measurements

a= [1,2,1,3,missing,3,1,4]
b=[2,1,3,1,856459,1,2,1]

Missing just means something COMPLETELY different here from the financial or employment survey or whatever context.

3 Likes

R’s cor takes the approach of having a couple keywords with a lot of settings. Not only do they handle missing values, a couple specify whether you want to throw errors, that it anticipates that some missing value patterns is so horrendous in some use cases that a user would repeatedly type that keyword to catch bad data. Even then, these settings don’t process all patterns of missing values in every possible way so you’ll have to do it separately anyway, which I would usually prefer instead of keyword settings.

I can understand and appreciate the value of this type of missing

but maybe the real complaint is that missing arises too easily? it seems all the arguments for status quo treat it as an “unobserved observation.” but then CSV.jl when reading a truncated line will fill the rest with missing, which makes an assumption that those values should have been valid observations but were otherwise unobserved

perhaps those should fill to nothing ? and same for lag and diagonal joins

1 Like

You could replace missing with nothing i.e. in lag’s default, I don’t see how it’ll help because we still have to handle the not-actual-data, and nothing throws errors instead of propagating, which seems harder to work with.

The original blog post that introduced Julia’s missing explicitly recommends using missing for “data scientist’s null” and nothing for “software engineer’s null”: First-Class Statistical Missing Values Support in Julia 0.7. I think this is more or less what @adienes is suggesting.

This is why I say that the Bayesian viewpoint may give a really different view on how missing should be handled.

In my example I would want to estimate the actual flash intensity, I would likely do that by building a model and in that model the non missing data would enter into the estimate while the missing data would be imputed.

For the lagged example I’d probably impute the missing values by repetition or by generating a random number of a certain type.

Yeah, that’s what I mean by errors vs propagation. The labels aren’t very appropriate in this thread because we have data scientists who are very vocal about not wanting to propagate missings, well some of them anyway. Whichever we use, the error vs propagation upon operations aspect isn’t the end-all-be-all, we can evidently handle either in many many ways before and after the operations. I’m not sure replacing missing-handling with nothing-handling with more thrown errors is the easy direction.

It’s why I’d really like to see an example real world dataset, because I can’t think of an example where skipmissing is that big of a deal, either the missing are benign and a small fraction of the data, or I want to do imputation or modeling. There just isn’t much in between.

it only takes one…

1 Like

This is a bit of a tangent, but is it common to add lagged columns to your dataframe for computing lagged correlations, and is this a main source of missings in practical data science? Again, I don’t do a lot of data analysis, but my instinct here would be to reach for signal processing functions from StatsBase:

julia> using StatsBase

julia> lags = [1, 3, 10];

julia> crosscor(df.x1, df.x2, lags)
3-element Vector{Float64}:
  0.019144023607046545
 -0.09897572568100832
 -0.0324592679530158

I appreciate the desire to reduce boilerplate when dealing with actually missing data from incomplete survey responses, faulty sensors, et cetera, but for the structural missings from lags and joins, it seems like it would be better to avoid them altogether? Or at least use nothing instead of missing to emphasize their structural nature, and have a concise idiom built into DataFrames.jl for skipping over nothing in the obvious way in arbitrary multivariate aggregations.

6 Likes

Right, but if it’s a small number often it’s ignorable you can just delete those observations from the dataset and move on, and if it’s a large number it needs modeling… it’s the intermediate case where there’s like more than you’d be happy just dropping and yet maybe you want some quick and dirty results without doing the modeling of proper imputation… I still think there’s probably a simple imputation method that might be worthwhile in most cases. Like for each missing select a random non-missing element from the vector, or linearly interpolate, or some such.

It’s part of why I’m interested in seeing some real world data because I’m wondering if there isn’t a solution that involves having a small set of simple imputation tools which could be applied to the data

When it comes to the structural stuff like the lag example I’m with @danielwe I think using the right tool for the job instead of building it by hand. crosscor, autocor, and related functions, which are also probably way more space efficient and have other advantages compared to doing stuff by hand adding lagged columns to a data frame.

I guess one question is how much of the issues you have with missing comes from assumptions about the way to do things that come from ecosystems where different patterns are common.

I’m not trying to minimize your struggles more like wrap my head around why I haven’t experienced similar struggles?

1 Like

if it’s a large number it needs modeling

nope! not always

here is another very real example for me. I have said the word “diagonal join” a few times but maybe it’s not clear what I mean by that. in this situation, a table is used to flatten out updates corresponding to the same market order in a financial trading application.

in this dummy example, it may appear that I should just do a groupby(:order_id) and combine to get only one row for each with order_px and fill_px. but in the general case, this is not really appropriate, since

  • a single order may have a large number of independent fills
  • the relative arrival order of these updates matters quite a lot and almost all the analysis will care about rolling statistics over each update sequentially

However, I still might want to understand things like

  • what is my mean fill rate per order
  • what is my mean slippage per order
  • etc…

which gets a lot more painful if I have to write various missing-wrangling things every 5 characters
(side wish: would absolutely love to have Over in DataFrames for situations like these expression functors (in particular: `over`) · Issue #3377 · JuliaData/DataFrames.jl · GitHub)

julia> df
12×4 DataFrame
 Row │ order_id  status  order_px  fill_px   
     │ Int64     Symbol  Int64?    Float64?  
─────┼───────────────────────────────────────
   1 │        1  NEW          101  missing   
   2 │        2  NEW          102  missing   
   3 │        1  ACK      missing  missing   
   4 │        3  NEW           98  missing   
   5 │        4  NEW           97  missing   
   6 │        2  ACK      missing  missing   
   7 │        1  FILL     missing      101.1
   8 │        2  FILL     missing      101.9
   9 │        3  ACK      missing  missing   
  10 │        4  ACK      missing  missing   
  11 │        4  FILL     missing       98.2
  12 │        3  FILL     missing       96.8

The missing here:

could easily be just 0.0 (every NEW order would have zero fill I suppose).
As a sidenote, it may be messy to handle this data, but handling a “log-file” of a complex system such as a continuous-time double-auction service isn’t easy. Perhaps until enough infrastructure is built by people sweating the details and making good Packages.

tell me about it :slight_smile:

but it could be easier if I had a few tools and behaviors like I get in the python dataframe libs

Seems like the root issue is there not being a distinguishing between “value unknown” and “no value”. Is there concretely something that DataFrames/Julia could do to provide that for you?