How to clean/filter/remove wrong data from DataFrame

Hi,
What’s the best way to fix the anomaly in column c?

test = DataFrame(a = 1:5, b = rand(5), c=[2,5,5,"%42,,",5])

Maybe your issue is caused by some loading pipeline? What about fixing the loading pipeline so that the dataframe has proper columns from the beginning?

If that is not possible, you need to define what “fix” means for you. We can then provide links to the sections of the DataFrames.jl documentation to help you.

1 Like

Thank you for the prompt answer.

Let’s assume that the data is imported from Excel.
Assume additionally that this is a large data set and manual modifications are not possible
By fix, I mean:

  • dropping rows,
  • replacement by missings
  • replacement by NaNs

likzew

Are you using XLSX.jl to import the data? Did you check if they provide any functionality to parse the cells according to a specific idiom?

You can check DataFrames.jl docs for the replace and dropmissing functions.

By reading the DataFrames.jl docs, we may get something like this:

dropmissing(ifelse.(isa.(df, (Number,)), df, missing))

However, the following seems to perform better:

df[vec(all(Matrix{Bool}(isa.(df, (Number,))), dims=2)), :]

What is the recommended way to drop rows with non-numeric entries?

2 Likes

Thank you kindly,

I fiqure out somthing like that:

To detect:

test.c[isa.(test.c,String)]

To replace

test.c[isa.(test.c,String)].=missing

To clean:

test! = dropmissing(test)

It seems that working.

Seems a bit redundant then to set the value to missing first? Why not just

test[.!isa.(test.c, String), :]
2 Likes

What about the general case where non-numeric entries can be anywhere?

Maybe

any.(eachrow(.!isa.(test, Number)))

It is shorter and nice but taking the eachrow path doesn’t seem as performant as creating the Matrix{Bool}.

I would hope that no one ever has to do this in a hot loop!

In DataFramesMeta.jl you would do

@rsubset df !(:c isa String)

Regarding this mutation:

test.c[isa.(test.c,String)].=missing

I want to point out that the update is done in place, and the element type of the column is still Vector{Any}. And, that would not be performant with subsequent processing. Another option is to just build and mutate the column:

test.c = [x isa Number ? x : missing for x in test.c]

After that, you can still use dropmissing! mutate the existing data frame.

In any case, it would be best if you can avoid loading in junk data into the data frame in the first place.

f(x) = typeof(x) <: Number
subset(test, Cols(:) .=> ByRow(f))