Csv and dataframes and missing values

Quite possibly one of the most beloved topics in this forum …
I have searched the forum and cannot find simple answers, or the answers are (literally) years old.

My csv file has strings and floats which might be missing but i want to substitute “” and 0.0. Is it possible to get CSV to do that automagically ?
I have not found a way reading through the CSV documentation.

Assuming that the automagic conversion can’t be done, then, if you are reading into a DataFrame, you can use dropmissing to get rid of rows where the offending column has a missing value. That is actually good enough most of the time, but now i have an instance where that’s not good enough.

So it looks like the best way to handle the situation is to use replace on the dataframe replacing values of missing with the desired value ? is that right ?

1 Like

I don’t think there is a way to make the "" become 0 instead of missing when you read it in.

But rather than replace you can use coalesce.(df.x, 0)

3 Likes

I meant that something like


a,b,,1.0,2.0
a,b,c,,3.0

should be read as

“a”,“b”,“”,1.0,2.0
“a”,“b”,“c”,0.0,3.0

so not “” → 0.0 but
missing string → “”
missing float64 → 0.0

thanks for the hint about coalesce, i didn’t know about that function, I’ll go read up on it.

oh i see. No, there is not a way to do this. You will have to use coalesce.

@purplishrock, shouldn’t a proper csv file have at least empty slots for the missing data?
Example:

a,b,,1.0,2.0
a,b,c,,3.0

These would then load as missings:

julia> CSV.File(file; header=false)
2-element CSV.File{false}:
 CSV.Row: (Column1 = "a", Column2 = "b", Column3 = missing, Column4 = 1.0, Column5 = 2.0)
 CSV.Row: (Column1 = "a", Column2 = "b", Column3 = "c", Column4 = missing, Column5 = 3.0)
2 Likes

yes, that’s right. Without using “preformatted text”, the extra commas i had in my example didn’t show up. You should now see them.

So, in that case as indicated by @pdeffebach, you can use coalesce:

df = CSV.File(file; header=false) |> DataFrame
df[:,3] .= coalesce.(df[:,3], "")
df[:,4] .= coalesce.(df[:,4], 0)
julia> df
2×5 DataFrame
│ Row │ Column1 │ Column2 │ Column3 │ Column4  │ Column5 │
│     │ String  │ String  │ String? │ Float64? │ Float64 │
├─────┼─────────┼─────────┼─────────┼──────────┼─────────┤
│ 1   │ a       │ b       │         │ 1.0      │ 2.0     │
│ 2   │ a       │ b       │ c       │ 0.0      │ 3.0     │
1 Like

And if you want to change the eltype of the column to exclude Missing, just do = instead of .=, at the cost of reallocating

2 Likes

Thank you @kevbonham ! I was wondering about that.

I am sort of wondering , given that you can provide an optional arg to set the expected type in CSV, why you cannot also provide an array which is “use this value in place of missing” as an optional arg.

Yeah probably. Feel free to file an issue on github.