Dropping missing values in dataframes

StatisticalMouse · October 25, 2020, 8:00am

The dataframe column’s describe shows:
5 age 29.6991 0.42 28.0 80.0 177 Union{Missing, Float64}

I can get rid of the missing values with this:
replace!(train_df.age, missing => median(skipmissing(train_df[:age])));

However, the type is still not correct:
5 age 29.3616 0.42 28.0 80.0 0 Union{Missing, Float64}

How do I do this so that the type becomes Float64?

Tamas_Papp · October 25, 2020, 8:34am

Since you are modifying in-place, of course the element type of the container will remain the same. There are many solutions, eg

using DataFrames, StatsBase, Statistics
df = DataFrame(:age => Vector{Union{Float64,Missing}}(rand(100)))
df.age[sample(axes(df, 1), 10)] .= missing
df.age = replace(df.age, missing => median(skipmissing(df.age)))

StatisticalMouse · October 25, 2020, 9:06am

Yes, indeed that works. I didn’t know that those two versions of replace work so differently. Thanks!

nilshg · October 25, 2020, 9:48am

I think you’re looking for disallowmissing!

StatisticalMouse · October 25, 2020, 10:24am

No. I’m looking for replacing missing values with reasonable guestimates.
https://cran.r-project.org/web/views/MissingData.html

danielw2904 · October 25, 2020, 11:56am

Maybe this can help https://github.com/invenia/Impute.jl

nilshg · October 25, 2020, 2:07pm

Apologies, in this case I misunderstood - the title of the thread is “Dropping missing values in dataframes”, which I read as implying that you’re looking to drop the missing values.

In your OP you also mentioned that after replacing missing values “the type is still not correct”, showing an example of a vector of type Union{Missing, Float64} and asked explicitly about how you can make the type Float64, which is what disallowmissing! can help you do.

Imputation is somewhat orthogonal to this - after imputation you might still have a Union type, and therefore might still want to call disallowmissing! to simplify subsequent analysis steps.

StatisticalMouse · October 25, 2020, 2:34pm

I’m sorry that my questions sound a bit confusing. I’ve been trying to figure out how to work with dataframes using a pipe of some kind, could be @linq like below, or could be one of the other ones.

using DataFrames
using DataFramesMeta
using CSV
using Statistics

data = "PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked\n87,0,3,\"Ford, Mr. William Neal\",male,16.0,1,3,W./C. 6608,34.375,,S\n89,1,1,\"Fortune, Miss. Mabel Helen\",female,23.0,3,2,19950,263.0,C23 C25 C27,S\n371,1,1,\"Harder, Mr. George Achilles\",male,25.0,1,0,11765,55.4417,E50,C\n421,0,3,\"Gheorgheff, Mr. Stanio\",male,,0,0,349254,7.8958,,C\n498,0,3,\"Shellard, Mr. Frederick William\",male,,0,0,C.A. 6212,15.1,,S\n511,1,3,\"Daly, Mr. Eugene Patrick\",male,29.0,0,0,382651,7.75,,Q\n538,1,1,\"LeRoy, Miss. Bertha\",female,30.0,0,0,PC 17761,106.425,,C\n627,0,2,\"Kirkland, Rev. Charles Leonard\",male,57.0,0,0,219533,12.35,,Q\n781,1,3,\"Ayoub, Miss. Banoura\",female,13.0,0,0,2687,7.2292,,C\n855,0,2,\"Carter, Mrs. Ernest Courtenay (Lilian Hughes)\",female,44.0,1,0,244252,26.0,,S\n"
df = CSV.read(IOBuffer(data))
df = @linq df |>
    deletecols([:Name, :SibSp, :Parch, :Cabin, :Embarked, :Ticket, :Fare]) |>
    rename(:PassengerId => :id, :Survived => :survived, :Pclass => :class, :Sex => :sex, :Age => :age)
df.age = replace(df.age, missing => median(skipmissing(df[:age])))
df = @transform(df, survived = convert.(Bool, :survived))
df.sex = categorical(df.sex)
df.class = categorical(df.class)
df

The code above works, but ideally those transformations would be in one pipe, and I couldn’t make it work. There is no particular need for that, except that it would look nice. Well, even with that it would not still be the best possible.

pdeffebach · October 25, 2020, 2:38pm

Sorry, why can’t you put the replace inside @transform? That can all be done in one pipe.

StatisticalMouse · October 25, 2020, 2:39pm

What do you mean?

pdeffebach · October 25, 2020, 2:41pm

This can be wrtten as

@transform(df, age = replace(:age, missing => median(skipmissing(:age))

StatisticalMouse · October 25, 2020, 3:25pm

Ok, I got it. This works:

using DataFrames
using DataFramesMeta
using CSV
using Statistics

data = "PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked\n87,0,3,\"Ford, Mr. William Neal\",male,16.0,1,3,W./C. 6608,34.375,,S\n89,1,1,\"Fortune, Miss. Mabel Helen\",female,23.0,3,2,19950,263.0,C23 C25 C27,S\n371,1,1,\"Harder, Mr. George Achilles\",male,25.0,1,0,11765,55.4417,E50,C\n421,0,3,\"Gheorgheff, Mr. Stanio\",male,,0,0,349254,7.8958,,C\n498,0,3,\"Shellard, Mr. Frederick William\",male,,0,0,C.A. 6212,15.1,,S\n511,1,3,\"Daly, Mr. Eugene Patrick\",male,29.0,0,0,382651,7.75,,Q\n538,1,1,\"LeRoy, Miss. Bertha\",female,30.0,0,0,PC 17761,106.425,,C\n627,0,2,\"Kirkland, Rev. Charles Leonard\",male,57.0,0,0,219533,12.35,,Q\n781,1,3,\"Ayoub, Miss. Banoura\",female,13.0,0,0,2687,7.2292,,C\n855,0,2,\"Carter, Mrs. Ernest Courtenay (Lilian Hughes)\",female,44.0,1,0,244252,26.0,,S\n"
df = CSV.read(IOBuffer(data))
df = @linq df |>
    deletecols([:Name, :SibSp, :Parch, :Cabin, :Embarked, :Ticket, :Fare]) |>
    rename(:PassengerId => :id, :Survived => :survived, :Pclass => :class, :Sex => :sex, :Age => :age) |>
    transform(age = replace(:age, missing => median(skipmissing(:age)))) |>
    transform(survived = convert.(Bool, :survived)) |>
    transform(sex = categorical(:sex)) |>
    transform(class = categorical(:class))

I find the documentation for this quite confusing. It’s in small pieces, while I would benefit more on having slightly larger examples like this.

StatisticalMouse · October 25, 2020, 3:32pm

What I would normally do is to try find examples of usage from Google. I did it here, but what I found were more small pieces that didn’t really help.

Here’s a radical idea: what if the documentation would already contain the examples that people otherwise need to ask in StackOverflow?

pdeffebach · October 25, 2020, 3:38pm

yes the DataFramesMeta docs certainly need an overhaul, and they will get one soon. The package is currently undergoing rapid development in preparation for a major release, and documentation will be part of that.

Tamas_Papp · October 26, 2020, 9:13am

It would be way too long, few people would read it, and users would still complain that it is missing some example someone asked on SO.

The purpose of good documentation is to introduce building blocks that allow you to craft your own solution after some investment into understanding. Examples are needed to help with that.

Topic		Replies	Views
How to change the type of a column of a DataFrame General Usage question	9	1418	January 1, 2021
Inplace mutation of DataFrame column of type Missing General Usage	3	176	April 5, 2023
DataFrame and Missings.replace() Data	10	4254	November 12, 2020
Replacing some DataFrame values based on their type, for multiple columns - limits of the df.colname syntax New to Julia	6	225	May 22, 2024
Can DataFramesMeta replace dummy values for all columns of a specific type General Usage question , dataframes , dataframesmeta	12	357	April 18, 2024

Dropping missing values in dataframes

Related topics