Replacing some DataFrame values based on their type, for multiple columns - limits of the df.colname syntax

mjmat · May 22, 2024, 2:07am

I have a DataFrame where missing values are indicated by somewhat random strings instead of real numbers, so the only way to know that a value is missing is to test whether it is not a Float64. I am trying to replace these random strings by the “proper” missing indicator.
I found a satisfactory (to me!) method to do this one column at a time, but I find the resulting code really not easy to read when I try to apply it to all variables in the DataFrame.
Here is an example

df = DataFrame(a = [1.0, 2.0, "N/A", 3.0, "bad"], b = [4.0, "error", 5.0, 6.0, 7.0])
# One column at a time (satisfactorily readable to me)
df.a[typeof.(df.a) .!= Float64] .= missing
# For all columns (not very readable to me, because of the df[:,colname] syntax)
for colname in names(df)
    df[typeof.(df[:, varname]) .!= Float64, colname] .= missing
end

I wish I could use the df.colname syntax in the second version, when colname is in a string variable, but I can’t figure out how to do that. And writing df[:, colname] leads to the above code, which I, for one, do not find very readable. In Matlab, I could have written df.(colname), adding parentheses to indicate that colname is not the actual name of the column but a string variable containing that column name.

I am opened to solutions with transform() or @transform but, to me (coming from Matlab), what I have found so far is even less readable.

Thanks!

Jollywatt · May 22, 2024, 6:06am

Where is your data coming from? You should almost certainly move this tidyihg step further up your pipeline, e.g., when parsing a CSV, than to have a DataFrame with columns of eltype Any.

Jollywatt · May 22, 2024, 6:07am

(Sorry for giving advice instead of answering the question!)

nilshg · May 22, 2024, 8:57am

I would write it like this:

julia> replace_non_floats(x) = [xᵢ isa Number ? xᵢ : missing for xᵢ ∈ x]
replace_non_floats (generic function with 1 method)

julia> mapcols!(replace_non_floats, df)
5×2 DataFrame
 Row │ a          b
     │ Float64?   Float64?
─────┼──────────────────────
   1 │       1.0        4.0
   2 │       2.0  missing
   3 │ missing          5.0
   4 │       3.0        6.0
   5 │ missing          7.0

Which has the additional benefit that it creates a narrower vector type if possible (your code keeps the columns Any which as Joseph says is usually a bad idea).

rocco_sprmnt21 · May 22, 2024, 9:17am

Try this sintax(not tested I am on phone)

df."$varname"

But nilshg advice is a better way to follows

juliohm · May 22, 2024, 9:32am

Alternative solution with TableTransforms.jl:

df |> Replace(x -> !(x isa Float64) => missing)

rocco_sprmnt21 · May 22, 2024, 12:34pm


julia> df=DataFrame((c1=[1,"b",2],c2=[21, 22, "cc"]))
3×2 DataFrame
 Row │ c1   c2  
     │ Any  Any
─────┼──────────
   1 │ 1    21
   2 │ b    22
   3 │ 2    cc

julia> transform!(df, Cols(:).=>(c->[r isa Number ? r : missing for r ∈ c]).=>identity)
3×2 DataFrame
 Row │ c1       c2      
     │ Int64?   Int64?
─────┼──────────────────
   1 │       1       21
   2 │ missing       22
   3 │       2  missing

julia> transform!(df, Cols(:).=>(c->[r isa Number ? r : missing for r ∈ c]),renamecols=false)
3×2 DataFrame
 Row │ c1       c2      
     │ Int64?   Int64?
─────┼──────────────────
   1 │       1       21
   2 │ missing       22
   3 │       2  missing

Topic		Replies	Views
DataFrame and Missings.replace() Data	10	4254	November 12, 2020
Replace missing values based on column data type General Usage package , plotting , strings , dataframes , missing-values	7	934	February 10, 2023
Can DataFramesMeta replace dummy values for all columns of a specific type General Usage question , dataframes , dataframesmeta	12	357	April 18, 2024
Replace Character in a column New to Julia dataframes	4	1103	August 24, 2019
Replacing missing and NaN values in dataframe New to Julia question , dataframes , missing-values	6	4103	March 29, 2022

Replacing some DataFrame values based on their type, for multiple columns - limits of the df.colname syntax

Related topics