Right, what should this do?
a=[1,2,missing,3,4]
b = lag(a)
cor(a,b)
Right, what should this do?
a=[1,2,missing,3,4]
b = lag(a)
cor(a,b)
the behavior of missing_cor
here Why are missing values not ignored by default? - #106 by adienes
They differ because of the data types, not because of how they handle certain values. The latter is much more diverse.
dlakelan was implying that there isn’t just 1 way to handle middle missing
s. Someone else would write a different function from you.
I’ve submitted the following pull request to make it easier to create Missing
-like types.
@adienes or others who want this behavior, could you please tell what convenience issues would remain if some package (Missings
, MissingStatistics
, whatever) defined all common aggregation functions (Statistics + StatsBase) that skip missings automatically? So that the user just had to type scor
/SM.cor
/bikeshed instead of cor
.
Seems like best of both worlds: missing handling that is both explicit, local, and without noticeable typing overhead.
Suppose you’re measuring the intensity of light from physical particles experiments or something, normally it’s a small number determined by thermal noise but every so often there’s a big flash… Sometimes that big flash can induce a saturation and failure to measure and then a missing is emitted. Your two detectors give the following measurements
a= [1,2,1,3,missing,3,1,4]
b=[2,1,3,1,856459,1,2,1]
Missing just means something COMPLETELY different here from the financial or employment survey or whatever context.
R’s cor takes the approach of having a couple keywords with a lot of settings. Not only do they handle missing values, a couple specify whether you want to throw errors, that it anticipates that some missing value patterns is so horrendous in some use cases that a user would repeatedly type that keyword to catch bad data. Even then, these settings don’t process all patterns of missing values in every possible way so you’ll have to do it separately anyway, which I would usually prefer instead of keyword settings.
I can understand and appreciate the value of this type of missing
but maybe the real complaint is that missing
arises too easily? it seems all the arguments for status quo treat it as an “unobserved observation.” but then CSV.jl
when reading a truncated line will fill the rest with missing
, which makes an assumption that those values should have been valid observations but were otherwise unobserved
perhaps those should fill to nothing
? and same for lag
and diagonal joins
You could replace missing
with nothing
i.e. in lag
’s default
, I don’t see how it’ll help because we still have to handle the not-actual-data, and nothing
throws errors instead of propagating, which seems harder to work with.
The original blog post that introduced Julia’s missing
explicitly recommends using missing
for “data scientist’s null” and nothing
for “software engineer’s null”: First-Class Statistical Missing Values Support in Julia 0.7. I think this is more or less what @adienes is suggesting.
This is why I say that the Bayesian viewpoint may give a really different view on how missing should be handled.
In my example I would want to estimate the actual flash intensity, I would likely do that by building a model and in that model the non missing data would enter into the estimate while the missing data would be imputed.
For the lagged example I’d probably impute the missing values by repetition or by generating a random number of a certain type.
Yeah, that’s what I mean by errors vs propagation. The labels aren’t very appropriate in this thread because we have data scientists who are very vocal about not wanting to propagate missing
s, well some of them anyway. Whichever we use, the error vs propagation upon operations aspect isn’t the end-all-be-all, we can evidently handle either in many many ways before and after the operations. I’m not sure replacing missing-handling with nothing-handling with more thrown errors is the easy direction.
It’s why I’d really like to see an example real world dataset, because I can’t think of an example where skipmissing is that big of a deal, either the missing are benign and a small fraction of the data, or I want to do imputation or modeling. There just isn’t much in between.
it only takes one…
This is a bit of a tangent, but is it common to add lagged columns to your dataframe for computing lagged correlations, and is this a main source of missings in practical data science? Again, I don’t do a lot of data analysis, but my instinct here would be to reach for signal processing functions from StatsBase
:
julia> using StatsBase
julia> lags = [1, 3, 10];
julia> crosscor(df.x1, df.x2, lags)
3-element Vector{Float64}:
0.019144023607046545
-0.09897572568100832
-0.0324592679530158
I appreciate the desire to reduce boilerplate when dealing with actually missing data from incomplete survey responses, faulty sensors, et cetera, but for the structural missings from lags and joins, it seems like it would be better to avoid them altogether? Or at least use nothing
instead of missing
to emphasize their structural nature, and have a concise idiom built into DataFrames.jl
for skipping over nothing
in the obvious way in arbitrary multivariate aggregations.
Right, but if it’s a small number often it’s ignorable you can just delete those observations from the dataset and move on, and if it’s a large number it needs modeling… it’s the intermediate case where there’s like more than you’d be happy just dropping and yet maybe you want some quick and dirty results without doing the modeling of proper imputation… I still think there’s probably a simple imputation method that might be worthwhile in most cases. Like for each missing select a random non-missing element from the vector, or linearly interpolate, or some such.
It’s part of why I’m interested in seeing some real world data because I’m wondering if there isn’t a solution that involves having a small set of simple imputation tools which could be applied to the data
When it comes to the structural stuff like the lag example I’m with @danielwe I think using the right tool for the job instead of building it by hand. crosscor, autocor, and related functions, which are also probably way more space efficient and have other advantages compared to doing stuff by hand adding lagged columns to a data frame.
I guess one question is how much of the issues you have with missing comes from assumptions about the way to do things that come from ecosystems where different patterns are common.
I’m not trying to minimize your struggles more like wrap my head around why I haven’t experienced similar struggles?
if it’s a large number it needs modeling
nope! not always
here is another very real example for me. I have said the word “diagonal join” a few times but maybe it’s not clear what I mean by that. in this situation, a table is used to flatten out updates corresponding to the same market order in a financial trading application.
in this dummy example, it may appear that I should just do a groupby(:order_id)
and combine to get only one row for each with order_px
and fill_px
. but in the general case, this is not really appropriate, since
However, I still might want to understand things like
which gets a lot more painful if I have to write various missing
-wrangling things every 5 characters
(side wish: would absolutely love to have Over
in DataFrames
for situations like these expression functors (in particular: `over`) · Issue #3377 · JuliaData/DataFrames.jl · GitHub)
julia> df
12×4 DataFrame
Row │ order_id status order_px fill_px
│ Int64 Symbol Int64? Float64?
─────┼───────────────────────────────────────
1 │ 1 NEW 101 missing
2 │ 2 NEW 102 missing
3 │ 1 ACK missing missing
4 │ 3 NEW 98 missing
5 │ 4 NEW 97 missing
6 │ 2 ACK missing missing
7 │ 1 FILL missing 101.1
8 │ 2 FILL missing 101.9
9 │ 3 ACK missing missing
10 │ 4 ACK missing missing
11 │ 4 FILL missing 98.2
12 │ 3 FILL missing 96.8
The missing
here:
could easily be just 0.0
(every NEW
order would have zero fill I suppose).
As a sidenote, it may be messy to handle this data, but handling a “log-file” of a complex system such as a continuous-time double-auction service isn’t easy. Perhaps until enough infrastructure is built by people sweating the details and making good Packages.
tell me about it
but it could be easier if I had a few tools and behaviors like I get in the python dataframe libs
Seems like the root issue is there not being a distinguishing between “value unknown” and “no value”. Is there concretely something that DataFrames/Julia could do to provide that for you?