Why are missing values not ignored by default?

I so agree with this. If a missing contains information then it should be imputed. If it doesn’t contain information, it should usually be removed from the specific dataset before analysis.

The case for frequent use of skipmissing is pretty weak imho. The main cases shown here have been around poorly organized data where the tabular form needs to have something in a slot because the organization of the table isn’t in some kind of normalized form (in the EF Codd sense Database normalization - Wikipedia )

image

In the example at that wiki page, professors fit into a table that includes the name of the course they’re teaching. If they’re not teaching this semester either you need NULL or you need to delete them from the table and hence lose track of the fact that they exist. This is often how the kinds of “skipmissing” we’ve been discussing arise. The table has missing for course taught because it’s “not applicable” but the real problem is that course taught has no business being a column in this table.

Instead, there should be two tables, one for a list of all the faculty with a faculty ID and name and hire date… and one for current teachers, with faculty ID and course code and semester… Two tables, no “missing” values appear.

Consider an example from above (and yes I know that real world data would be even more complicated and that quite possibly you’d receive this data built by someone else and not have control over its form and that adienes is a professional and probably does know a lot about the topics I am about to discuss nevertheless others maybe don’t)

In this dataset, order acknowledgements simply don’t have prices associated with them, so the last two columns are put as missing (row 3 for example). Also NEW events only have “order_px” but FILL events only have fill_px…

This is just a disaster that makes EF Codd roll in his grave unfortunately. Fortunately 50 years ago or so he showed us how to handle this sort of stuff…

The normalized way to handle this data is to have a table of events, and a table of event prices…

  • The first table would have event_ID, order_id, and status. (event ID plays the role that Row is playing)
  • The second table would have event ID and price.

If you receive this messy table above from some data vendor who has never studied database theory, the first task is to normalize the data… So you split it up into the two tables…

Now you have ZERO missing values in your dataset. NONE at all…

Next you may want to do something like for each order_id show all the FILL values associated…

select a.order_id,a.event_id,a.status,b.price from 
eventtable a join pricetable b on a.event_id = b.event_id 
order by a.order_id,a.event_id where a.status = "FILL";

Now you have a history of all the fills for every order, and there are NO missing values at all.

I think a major reason I almost never use skipmissing is because I often normalize poorly organized data as the first few steps in my analyses.

Any remaining missing data is often missing but meaningful and can then be imputed, or ignored if it represents an ignorable fraction of the data.

My impression is this “old school” database theory stuff is not commonly taught these days, and that there’s been a huge rush to create “NoSQL” databases and ignore all the really smart stuff that Codd and others figured out 50 years ago.

Now excuse me while I go yell at some kids to get off my lawn… :sweat_smile:

11 Likes

that’s a lot of gatekeeping… but this design works perfectly fine for me with the missing semantics and tools in polars. your solution may be technically “cleaner” but would require a whole lot more joins all over the place when I need to do analysis relying on columns from both tables (which is basically always)

and it still doesn’t address the case when missing arises from asof joins or lag

this is exactly the “you’re holding it wrong” attitude that the Julia community gets flak for from the public.

2 Likes

Looked at from my perspective, there is about 50 years of PhD level math and computer science theory around how to organize tabular data that specifically addresses these types of issues and Julia chose to align its organizational principles with that theory, and I’m really happy about it.

Rather than attempting to keep that information esoteric and available to the few, I chose to link to it and describe how it works, that’s the opposite of gatekeeping.

You’re free to ignore that pile of theory, but I think the Julia community is right to emphasize that theory in its design choices. The missing semantics we have are designed primarily for “missing unknown” and not for “missing not applicable” because there is almost always a way to organize your data analysis task so that “missing not applicable” doesn’t occur and Codd showed us how that works in the 1970s.

Julia is a community with a very high level of theoretical knowledge in many fields. That the ecosystem reflects this is a strength in my opinion.

5 Likes

Theory is great. I like and appreciate theory. In undergrad I took almost exclusively theory courses.

In this case, if I use the database “theory” you are suggesting, my code will get probably 50%+ more verbose due to the need to sprinkle in .join(new_order_table, on="order_id") on every single line. And more verbosity is antithetical to this thread which is about making exploratory and interactive analysis more ergonomic.

I am not looking to find the platonic design for data analysis, I’m looking to do less typing.

1 Like

Maybe the area of ergonomics needed is at that join level? Honestly don’t know but it’s worth exploring that question I think.

1 Like

I can totally understand your frustration as well as the desire for defaults fitting your use case in interactive exploration. Yet, somewhat different defaults have been chosen in Julia as well as in R as shown by @pleiby. There have also been given different reasons explaining the defaults – which might or might not convince you.
On the other hand, I also agree with @dlakelan that having to handle missings all the time is a design issue and could (should) be fixed on a more fundamental level. As a start, any time you feel repeating the same pattern, stop and try to abstract this pattern – either by just defining a function or rethinking your data layout. Imho, DRY is one of the most important principles of software design with many additional benefits besides saving typing. Overall, simple abstractions for repeated patterns quickly turn into a small library expressing the needs of the domain your program is designed to handle. Julia is great in this respect with low-overhead functions and user-defined types as well as extensible interfaces via multiple dispatch. Especially, when you have to type the same thing hundreds of times you are already way above the borderline between interactive exploration vs proper domain abstractions. (I have made it into a personal habit to refactor and extract abstraction the first time I use copy-and-paste, i.e., I would repeat the same pattern just twice!)

Let’s try to avoid a pile-on here. I feel that we’ve gone around in enough circles in this thread already. Perhaps a moderator can close or put a time limit on this thread. If we have more specific issues to discuss, we can of course open a new thread.

9 Likes

I surely agree with you that piling on must be avoided. And I totally respect @adienes position, although is different for mine.

At the same time, I find it important that a plurality of voices is expressed here. @adienes position, as that of everyone else, is a personal one, not a representative position for data scientists.

I prefer more verbose, and explicit, readable, code rather than defaults that hide important features of my data (such as the presence of missings). Even when this forces me to write a bit more characters.

3 Likes

@adienes Can you give a list of the functions you want skipmissing versions of? Or is it like “all vector -> scalar functions” so you can’t give a list?

especially given that you can just edit your startup.jl config to add const sm=skipmissing and never ever worry about it again (untill you change your machine or something)

I read above that there might not be enough data scientists’ opinions represented here. I am a data scientist who has dealt with my fair share of missing values, and here is what I think:

  • missing values should not be treated as numbers by default (not sure if there is disagreement about it at this point.)

  • I don’t want convenience at the cost of safety. Ideally I want to get a warning every time I mix numbers and missings. I don’t like that you can do < or > comparisons with missing values in Stata, and that pandas turns anything that looks like NA, N/A, NA into equivalent of a NaN by default (thank you DataFrames.jl for not doing that.). Having to explicitly deal with missings and NaNs, and learn about how numbers are represented in a computers’ memory has made me a better scientist and a better programmer.

  • That said, convenience is nice. Yes, it would be nice to not have to scatter your code with isfinite(x) ? something : something_else or skipmissing. If you have made an informed decision that you want to use some overarching treatment to missing values (as in OP’s case), it would be nice to have a simple way to do it. Maybe it’s a package or maybe it’s syntax (though maybe not the ‘?’ character).

I also like that I can come to one thread and read about everything from different algorithms for calculating means and partial versus total ordering. I think 2 pull requests came out of this thread, if I’m not wrong. And people invested time in writing example code to understand/offer solutions. This is the community that was thoughtful enough to come up with a Missing type. I’m sure it can come up with a solution to making working with it more ergonomic if that’s what we really want.

13 Likes