I so agree with this. If a missing contains information then it should be imputed. If it doesn’t contain information, it should usually be removed from the specific dataset before analysis.
The case for frequent use of skipmissing is pretty weak imho. The main cases shown here have been around poorly organized data where the tabular form needs to have something in a slot because the organization of the table isn’t in some kind of normalized form (in the EF Codd sense Database normalization - Wikipedia )
In the example at that wiki page, professors fit into a table that includes the name of the course they’re teaching. If they’re not teaching this semester either you need NULL or you need to delete them from the table and hence lose track of the fact that they exist. This is often how the kinds of “skipmissing” we’ve been discussing arise. The table has missing for course taught because it’s “not applicable” but the real problem is that course taught has no business being a column in this table.
Instead, there should be two tables, one for a list of all the faculty with a faculty ID and name and hire date… and one for current teachers, with faculty ID and course code and semester… Two tables, no “missing” values appear.
Consider an example from above (and yes I know that real world data would be even more complicated and that quite possibly you’d receive this data built by someone else and not have control over its form and that adienes is a professional and probably does know a lot about the topics I am about to discuss nevertheless others maybe don’t)
In this dataset, order acknowledgements simply don’t have prices associated with them, so the last two columns are put as missing (row 3 for example). Also NEW events only have “order_px” but FILL events only have fill_px…
This is just a disaster that makes EF Codd roll in his grave unfortunately. Fortunately 50 years ago or so he showed us how to handle this sort of stuff…
The normalized way to handle this data is to have a table of events, and a table of event prices…
- The first table would have event_ID, order_id, and status. (event ID plays the role that Row is playing)
- The second table would have event ID and price.
If you receive this messy table above from some data vendor who has never studied database theory, the first task is to normalize the data… So you split it up into the two tables…
Now you have ZERO missing values in your dataset. NONE at all…
Next you may want to do something like for each order_id show all the FILL values associated…
select a.order_id,a.event_id,a.status,b.price from
eventtable a join pricetable b on a.event_id = b.event_id
order by a.order_id,a.event_id where a.status = "FILL";
Now you have a history of all the fills for every order, and there are NO missing values at all.
I think a major reason I almost never use skipmissing is because I often normalize poorly organized data as the first few steps in my analyses.
Any remaining missing data is often missing but meaningful and can then be imputed, or ignored if it represents an ignorable fraction of the data.
My impression is this “old school” database theory stuff is not commonly taught these days, and that there’s been a huge rush to create “NoSQL” databases and ignore all the really smart stuff that Codd and others figured out 50 years ago.
Now excuse me while I go yell at some kids to get off my lawn…