DataFrame and Missings.replace()

Why does this example (adapted from the DataFrames documentation) work:

julia> x = [1, 2, missing]
3-element Array{Union{Int64, Missings.Missing},1}:
 1       
 2       
  missing

julia> df = collect(Missings.replace(x, 1))
3-element Array{Int64,1}:
 1
 2
 1

and this not?:

input_file = "/home/js/Downloads/data-1512997404715.csv"
df = CSV.read(input_file, nullable=true)
x = collect(Missings.replace(df, "\\N"))
ERROR: MethodError: no method matching start(::DataFrames.DataFrame)
Closest candidates are:
  start(::SimpleVector) at essentials.jl:258
  start(::Base.MethodList) at reflection.jl:560
  start(::ExponentialBackOff) at error.jl:107

Stacktrace:
[1] copy!(::Array{Any,1}, ::Missings.EachReplaceMissing{DataFrames.DataFrame,String}) at ./abstractarray.jl:573
[2] _collect(::UnitRange{Int64}, ::Missings.EachReplaceMissing{DataFrames.DataFrame,String}, ::Base.HasEltype, ::Base.HasLength) at ./array.jl:437
[3] collect(::Missings.EachReplaceMissing{DataFrames.DataFrame,String}) at ./array.jl:431

I am building a script that create a sql dumpfile from a csv and want to replace all the missing values with "\N"

Regards
Johann

The top-most level of the dataframe object / struct isn’t the iterable part

df1 = DataFrame(a=collect(1:10), b= collect(10:-1:1));
d2 = collect(Missings.replace(df.columns, 0));
df2 = DataFrame(a = d2[1], b= d2[2])
> Row | x1 | x2
> ...

Thanks. You inspired me to look at the source and this seems to work now where df is the original dataframe:

df2 = DataFrame(collect(Missings.replace(df.columns, "\\N")), names(df))

It doesn’t seem to be working properly though

df1 = DataFrame(a=[missing; collect(1:10)], b=[collect(10:-1:1); missing]);
d2 = collect(Missings.replace(df.columns, 0)); ## missing columns
d2 = collect(Missings.replace(df.columns[1], 0)); ## one column
d4= [collect(Missings.replace(a, 0)) for a in df.columns];

Correct. My “solution” did not produce an error message but it did not do the job.

with

[collect(Missings.replace(a, 0)) for a in df.columns]

I now get

MethodError: Cannot `convert` an object of type String to an object of type Int64
This may have arisen from a call to the constructor Int64(...),
since type constructors fall back to convert methods.

Stacktrace:
 [1] replace(::Array{Union{Int64, Missings.Missing},1}, ::String) at /home/js/.julia/v0.6/Missings/src/Missings.jl:276
 [2] collect_to!(::Array{CategoricalArrays.CategoricalArray{String,1,UInt32,String,CategoricalArrays.CategoricalString{UInt32},Union{}},1}, ::Base.Generator{Array{Any,1},##21#22}, ::Int64, ::Int64) at ./array.jl:508
 [3] collect(::Base.Generator{Array{Any,1},##21#22}) at ./array.jl:476
 [4] include_string(::String, ::String) at ./loading.jl:522

Maybe I should first convert all the values in the dataframe to strings before trying this.
In the end it would be written as strings to an pg_dump file anyhow.
Not that I know how to do it at this stage. But I will find out.

Thanks for thinking with me.

It’s close but still not quite

d = Dict([Union{Int64, Missing}=>0, Union{String, Missing}=>""])
d2 = [collect(Missings.replace(a, d[eltype(a)])) for a in df.columns];
> key union{Int64, Missing} not found 

Passing a full set of replacement missing values (one per column), as in for a,b in df.columns, dfmissingvec = [0, 0, "", "etc"] should be okay though

My problem is that I will not know beforehand what the types of the columns will be.

You can try something like:

for (v, name) in eachcol(df)
    df[name] = collect(Missings.replace(v, "\\N"))
end

collect(Missings.replace(v, "\\N")) can also be coalesce.(v, "\\N") or recode(v, missing => "\\N") (the latter is in CategoricalArrays). But if you don’t know the type of the columns in advance I don’t see how you could choose an appropriate replacement…

1 Like

Thanks. However that will only work on columns with a String type.

With names(df) I get an array of column headers. How can I use that to create a second dataframe with those names but of the type String for each column? If I can do that, I can possibly copy the non strings values to the second dataframe with string(v)?

I think it v and name should be replaced in the for statement:

for (name, v) in eachcol(df)
    df[name] = collect(Missings.replace(v, "\\N"))
end
1 Like

I like this solution that I picked up from far down the page in this stackoverflow:

coalesce.(df, 0)

# or rather 
x = coalesce.(df, "\\N")

I’d prefer to use replace and the pairs definition of the transform like the following, but the SO example only works for a single column, the coalesce brodcast makes the replacement for every column every row that has missing.

replace!(df.x, missing => 0)