I have a DataFrame where missing values are indicated by somewhat random strings instead of real numbers, so the only way to know that a value is missing is to test whether it is not a Float64. I am trying to replace these random strings by the “proper” missing indicator.
I found a satisfactory (to me!) method to do this one column at a time, but I find the resulting code really not easy to read when I try to apply it to all variables in the DataFrame.
Here is an example
df = DataFrame(a = [1.0, 2.0, "N/A", 3.0, "bad"], b = [4.0, "error", 5.0, 6.0, 7.0])
# One column at a time (satisfactorily readable to me)
df.a[typeof.(df.a) .!= Float64] .= missing
# For all columns (not very readable to me, because of the df[:,colname] syntax)
for colname in names(df)
df[typeof.(df[:, varname]) .!= Float64, colname] .= missing
end
I wish I could use the df.colname syntax in the second version, when colname is in a string variable, but I can’t figure out how to do that. And writing df[:, colname] leads to the above code, which I, for one, do not find very readable. In Matlab, I could have written df.(colname), adding parentheses to indicate that colname is not the actual name of the column but a string variable containing that column name.
I am opened to solutions with transform() or @transform but, to me (coming from Matlab), what I have found so far is even less readable.
Where is your data coming from? You should almost certainly move this tidyihg step further up your pipeline, e.g., when parsing a CSV, than to have a DataFrame with columns of eltype Any.
julia> replace_non_floats(x) = [xᵢ isa Number ? xᵢ : missing for xᵢ ∈ x]
replace_non_floats (generic function with 1 method)
julia> mapcols!(replace_non_floats, df)
5×2 DataFrame
Row │ a b
│ Float64? Float64?
─────┼──────────────────────
1 │ 1.0 4.0
2 │ 2.0 missing
3 │ missing 5.0
4 │ 3.0 6.0
5 │ missing 7.0
Which has the additional benefit that it creates a narrower vector type if possible (your code keeps the columns Any which as Joseph says is usually a bad idea).