How to clean/filter/remove wrong data from DataFrame

likzew · February 20, 2022, 12:04pm

Hi,
What’s the best way to fix the anomaly in column c?

test = DataFrame(a = 1:5, b = rand(5), c=[2,5,5,"%42,,",5])

juliohm · February 20, 2022, 12:30pm

Maybe your issue is caused by some loading pipeline? What about fixing the loading pipeline so that the dataframe has proper columns from the beginning?

If that is not possible, you need to define what “fix” means for you. We can then provide links to the sections of the DataFrames.jl documentation to help you.

likzew · February 20, 2022, 12:44pm

Thank you for the prompt answer.

Let’s assume that the data is imported from Excel.
Assume additionally that this is a large data set and manual modifications are not possible
By fix, I mean:

dropping rows,
replacement by missings
replacement by NaNs

likzew

juliohm · February 20, 2022, 1:31pm

Are you using XLSX.jl to import the data? Did you check if they provide any functionality to parse the cells according to a specific idiom?

You can check DataFrames.jl docs for the replace and dropmissing functions.

rafael.guerra · February 20, 2022, 2:11pm

By reading the DataFrames.jl docs, we may get something like this:

dropmissing(ifelse.(isa.(df, (Number,)), df, missing))

However, the following seems to perform better:

df[vec(all(Matrix{Bool}(isa.(df, (Number,))), dims=2)), :]

What is the recommended way to drop rows with non-numeric entries?

likzew · February 20, 2022, 2:24pm

Thank you kindly,

I fiqure out somthing like that:

To detect:

test.c[isa.(test.c,String)]

To replace

test.c[isa.(test.c,String)].=missing

To clean:

test! = dropmissing(test)

It seems that working.

nilshg · February 20, 2022, 4:59pm

Seems a bit redundant then to set the value to missing first? Why not just

test[.!isa.(test.c, String), :]

rafael.guerra · February 20, 2022, 5:12pm

What about the general case where non-numeric entries can be anywhere?

nilshg · February 20, 2022, 5:59pm

Maybe

any.(eachrow(.!isa.(test, Number)))

rafael.guerra · February 20, 2022, 6:32pm

It is shorter and nice but taking the eachrow path doesn’t seem as performant as creating the Matrix{Bool}.

nilshg · February 20, 2022, 6:58pm

I would hope that no one ever has to do this in a hot loop!

pdeffebach · February 20, 2022, 7:33pm

In DataFramesMeta.jl you would do

@rsubset df !(:c isa String)

tk3369 · February 21, 2022, 12:07am

Regarding this mutation:

test.c[isa.(test.c,String)].=missing

I want to point out that the update is done in place, and the element type of the column is still Vector{Any}. And, that would not be performant with subsequent processing. Another option is to just build and mutate the column:

test.c = [x isa Number ? x : missing for x in test.c]

After that, you can still use dropmissing! mutate the existing data frame.

In any case, it would be best if you can avoid loading in junk data into the data frame in the first place.

DataFrames · February 21, 2022, 12:52am

f(x) = typeof(x) <: Number
subset(test, Cols(:) .=> ByRow(f))

Topic		Replies	Views
DataFrame can't get filter to get rid of rows where a column contains a blank " " New to Julia dataframes	8	1078	February 19, 2022
Dealing with NaN's General Usage dataframes	21	5548	April 27, 2021
DataFrames: How to remove rows containing NaNs when there are also missings General Usage dataframes	9	2386	November 28, 2024
How to remove rows containing missing from DataFrame? New to Julia	6	13210	July 22, 2019
Replacing missing and NaN values in dataframe New to Julia question , dataframes , missing-values	6	4099	March 29, 2022

How to clean/filter/remove wrong data from DataFrame

Related topics