Csv and dataframes and missing values

purplishrock · June 10, 2021, 11:00pm

Quite possibly one of the most beloved topics in this forum …
I have searched the forum and cannot find simple answers, or the answers are (literally) years old.

My csv file has strings and floats which might be missing but i want to substitute “” and 0.0. Is it possible to get CSV to do that automagically ?
I have not found a way reading through the CSV documentation.

Assuming that the automagic conversion can’t be done, then, if you are reading into a DataFrame, you can use dropmissing to get rid of rows where the offending column has a missing value. That is actually good enough most of the time, but now i have an instance where that’s not good enough.

So it looks like the best way to handle the situation is to use replace on the dataframe replacing values of missing with the desired value ? is that right ?

pdeffebach · June 10, 2021, 11:49pm

I don’t think there is a way to make the "" become 0 instead of missing when you read it in.

But rather than replace you can use coalesce.(df.x, 0)

purplishrock · June 10, 2021, 11:54pm

I meant that something like


a,b,,1.0,2.0
a,b,c,,3.0

should be read as

“a”,“b”,“”,1.0,2.0
“a”,“b”,“c”,0.0,3.0

so not “” → 0.0 but
missing string → “”
missing float64 → 0.0

thanks for the hint about coalesce, i didn’t know about that function, I’ll go read up on it.

pdeffebach · June 11, 2021, 12:06am

oh i see. No, there is not a way to do this. You will have to use coalesce.

rafael.guerra · June 11, 2021, 10:45am

@purplishrock, shouldn’t a proper csv file have at least empty slots for the missing data?
Example:

a,b,,1.0,2.0
a,b,c,,3.0

These would then load as missings:

julia> CSV.File(file; header=false)
2-element CSV.File{false}:
 CSV.Row: (Column1 = "a", Column2 = "b", Column3 = missing, Column4 = 1.0, Column5 = 2.0)
 CSV.Row: (Column1 = "a", Column2 = "b", Column3 = "c", Column4 = missing, Column5 = 3.0)

purplishrock · June 12, 2021, 7:23pm

yes, that’s right. Without using “preformatted text”, the extra commas i had in my example didn’t show up. You should now see them.

rafael.guerra · June 12, 2021, 7:36pm

So, in that case as indicated by @pdeffebach, you can use coalesce:

df = CSV.File(file; header=false) |> DataFrame
df[:,3] .= coalesce.(df[:,3], "")
df[:,4] .= coalesce.(df[:,4], 0)
julia> df
2×5 DataFrame
│ Row │ Column1 │ Column2 │ Column3 │ Column4  │ Column5 │
│     │ String  │ String  │ String? │ Float64? │ Float64 │
├─────┼─────────┼─────────┼─────────┼──────────┼─────────┤
│ 1   │ a       │ b       │         │ 1.0      │ 2.0     │
│ 2   │ a       │ b       │ c       │ 0.0      │ 3.0     │

kevbonham · June 12, 2021, 8:19pm

And if you want to change the eltype of the column to exclude Missing, just do = instead of .=, at the cost of reallocating

purplishrock · June 12, 2021, 10:46pm

Thank you @kevbonham ! I was wondering about that.

I am sort of wondering , given that you can provide an optional arg to set the expected type in CSV, why you cannot also provide an array which is “use this value in place of missing” as an optional arg.

pdeffebach · June 12, 2021, 11:05pm

Yeah probably. Feel free to file an issue on github.

Topic		Replies	Views
Read NULL as missing in dataframe General Usage dataframes , csv	1	972	June 30, 2021
Replace Character in a column New to Julia dataframes	4	1103	August 24, 2019
How to read a table with missing values using CSV? Data package , data	9	7381	February 20, 2018
Replacing missing and NaN values in dataframe New to Julia question , dataframes , missing-values	6	4103	March 29, 2022
Dealing with missing values in DataFrame (again) Data	5	1272	August 23, 2020

Csv and dataframes and missing values

Related topics