Hi,
What’s the best way to fix the anomaly in column c?
test = DataFrame(a = 1:5, b = rand(5), c=[2,5,5,"%42,,",5])
Hi,
What’s the best way to fix the anomaly in column c?
test = DataFrame(a = 1:5, b = rand(5), c=[2,5,5,"%42,,",5])
Maybe your issue is caused by some loading pipeline? What about fixing the loading pipeline so that the dataframe has proper columns from the beginning?
If that is not possible, you need to define what “fix” means for you. We can then provide links to the sections of the DataFrames.jl documentation to help you.
Thank you for the prompt answer.
Let’s assume that the data is imported from Excel.
Assume additionally that this is a large data set and manual modifications are not possible
By fix, I mean:
likzew
Are you using XLSX.jl to import the data? Did you check if they provide any functionality to parse the cells according to a specific idiom?
You can check DataFrames.jl docs for the replace
and dropmissing
functions.
By reading the DataFrames.jl
docs, we may get something like this:
dropmissing(ifelse.(isa.(df, (Number,)), df, missing))
However, the following seems to perform better:
df[vec(all(Matrix{Bool}(isa.(df, (Number,))), dims=2)), :]
What is the recommended way to drop rows with non-numeric entries?
Thank you kindly,
I fiqure out somthing like that:
To detect:
test.c[isa.(test.c,String)]
To replace
test.c[isa.(test.c,String)].=missing
To clean:
test! = dropmissing(test)
It seems that working.
Seems a bit redundant then to set the value to missing first? Why not just
test[.!isa.(test.c, String), :]
What about the general case where non-numeric entries can be anywhere?
Maybe
any.(eachrow(.!isa.(test, Number)))
It is shorter and nice but taking the eachrow
path doesn’t seem as performant as creating the Matrix{Bool}
.
I would hope that no one ever has to do this in a hot loop!
In DataFramesMeta.jl you would do
@rsubset df !(:c isa String)
Regarding this mutation:
test.c[isa.(test.c,String)].=missing
I want to point out that the update is done in place, and the element type of the column is still Vector{Any}
. And, that would not be performant with subsequent processing. Another option is to just build and mutate the column:
test.c = [x isa Number ? x : missing for x in test.c]
After that, you can still use dropmissing!
mutate the existing data frame.
In any case, it would be best if you can avoid loading in junk data into the data frame in the first place.
f(x) = typeof(x) <: Number
subset(test, Cols(:) .=> ByRow(f))