Dropping missing values in dataframes

The dataframe column’s describe shows:
5 age 29.6991 0.42 28.0 80.0 177 Union{Missing, Float64}

I can get rid of the missing values with this:
replace!(train_df.age, missing => median(skipmissing(train_df[:age])));

However, the type is still not correct:
5 age 29.3616 0.42 28.0 80.0 0 Union{Missing, Float64}

How do I do this so that the type becomes Float64?

Since you are modifying in-place, of course the element type of the container will remain the same. There are many solutions, eg

using DataFrames, StatsBase, Statistics
df = DataFrame(:age => Vector{Union{Float64,Missing}}(rand(100)))
df.age[sample(axes(df, 1), 10)] .= missing
df.age = replace(df.age, missing => median(skipmissing(df.age)))
1 Like

Yes, indeed that works. I didn’t know that those two versions of replace work so differently. Thanks!

I think you’re looking for disallowmissing!

No. I’m looking for replacing missing values with reasonable guestimates.
https://cran.r-project.org/web/views/MissingData.html

Maybe this can help https://github.com/invenia/Impute.jl

Apologies, in this case I misunderstood - the title of the thread is “Dropping missing values in dataframes”, which I read as implying that you’re looking to drop the missing values.

In your OP you also mentioned that after replacing missing values “the type is still not correct”, showing an example of a vector of type Union{Missing, Float64} and asked explicitly about how you can make the type Float64, which is what disallowmissing! can help you do.

Imputation is somewhat orthogonal to this - after imputation you might still have a Union type, and therefore might still want to call disallowmissing! to simplify subsequent analysis steps.

1 Like

I’m sorry that my questions sound a bit confusing. I’ve been trying to figure out how to work with dataframes using a pipe of some kind, could be @linq like below, or could be one of the other ones.

using DataFrames
using DataFramesMeta
using CSV
using Statistics

data = "PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked\n87,0,3,\"Ford, Mr. William Neal\",male,16.0,1,3,W./C. 6608,34.375,,S\n89,1,1,\"Fortune, Miss. Mabel Helen\",female,23.0,3,2,19950,263.0,C23 C25 C27,S\n371,1,1,\"Harder, Mr. George Achilles\",male,25.0,1,0,11765,55.4417,E50,C\n421,0,3,\"Gheorgheff, Mr. Stanio\",male,,0,0,349254,7.8958,,C\n498,0,3,\"Shellard, Mr. Frederick William\",male,,0,0,C.A. 6212,15.1,,S\n511,1,3,\"Daly, Mr. Eugene Patrick\",male,29.0,0,0,382651,7.75,,Q\n538,1,1,\"LeRoy, Miss. Bertha\",female,30.0,0,0,PC 17761,106.425,,C\n627,0,2,\"Kirkland, Rev. Charles Leonard\",male,57.0,0,0,219533,12.35,,Q\n781,1,3,\"Ayoub, Miss. Banoura\",female,13.0,0,0,2687,7.2292,,C\n855,0,2,\"Carter, Mrs. Ernest Courtenay (Lilian Hughes)\",female,44.0,1,0,244252,26.0,,S\n"
df = CSV.read(IOBuffer(data))
df = @linq df |>
    deletecols([:Name, :SibSp, :Parch, :Cabin, :Embarked, :Ticket, :Fare]) |>
    rename(:PassengerId => :id, :Survived => :survived, :Pclass => :class, :Sex => :sex, :Age => :age)
df.age = replace(df.age, missing => median(skipmissing(df[:age])))
df = @transform(df, survived = convert.(Bool, :survived))
df.sex = categorical(df.sex)
df.class = categorical(df.class)
df

The code above works, but ideally those transformations would be in one pipe, and I couldn’t make it work. There is no particular need for that, except that it would look nice. Well, even with that it would not still be the best possible.

Sorry, why can’t you put the replace inside @transform? That can all be done in one pipe.

What do you mean?

This can be wrtten as

@transform(df, age = replace(:age, missing => median(skipmissing(:age))
1 Like

Ok, I got it. This works:

using DataFrames
using DataFramesMeta
using CSV
using Statistics

data = "PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked\n87,0,3,\"Ford, Mr. William Neal\",male,16.0,1,3,W./C. 6608,34.375,,S\n89,1,1,\"Fortune, Miss. Mabel Helen\",female,23.0,3,2,19950,263.0,C23 C25 C27,S\n371,1,1,\"Harder, Mr. George Achilles\",male,25.0,1,0,11765,55.4417,E50,C\n421,0,3,\"Gheorgheff, Mr. Stanio\",male,,0,0,349254,7.8958,,C\n498,0,3,\"Shellard, Mr. Frederick William\",male,,0,0,C.A. 6212,15.1,,S\n511,1,3,\"Daly, Mr. Eugene Patrick\",male,29.0,0,0,382651,7.75,,Q\n538,1,1,\"LeRoy, Miss. Bertha\",female,30.0,0,0,PC 17761,106.425,,C\n627,0,2,\"Kirkland, Rev. Charles Leonard\",male,57.0,0,0,219533,12.35,,Q\n781,1,3,\"Ayoub, Miss. Banoura\",female,13.0,0,0,2687,7.2292,,C\n855,0,2,\"Carter, Mrs. Ernest Courtenay (Lilian Hughes)\",female,44.0,1,0,244252,26.0,,S\n"
df = CSV.read(IOBuffer(data))
df = @linq df |>
    deletecols([:Name, :SibSp, :Parch, :Cabin, :Embarked, :Ticket, :Fare]) |>
    rename(:PassengerId => :id, :Survived => :survived, :Pclass => :class, :Sex => :sex, :Age => :age) |>
    transform(age = replace(:age, missing => median(skipmissing(:age)))) |>
    transform(survived = convert.(Bool, :survived)) |>
    transform(sex = categorical(:sex)) |>
    transform(class = categorical(:class))

I find the documentation for this quite confusing. It’s in small pieces, while I would benefit more on having slightly larger examples like this.

What I would normally do is to try find examples of usage from Google. I did it here, but what I found were more small pieces that didn’t really help.

Here’s a radical idea: what if the documentation would already contain the examples that people otherwise need to ask in StackOverflow?

yes the DataFramesMeta docs certainly need an overhaul, and they will get one soon. The package is currently undergoing rapid development in preparation for a major release, and documentation will be part of that.

1 Like

It would be way too long, few people would read it, and users would still complain that it is missing some example someone asked on SO.

The purpose of good documentation is to introduce building blocks that allow you to craft your own solution after some investment into understanding. Examples are needed to help with that.

1 Like