DataFrame and Missings.replace()


#1

Why does this example (adapted from the DataFrames documentation) work:

julia> x = [1, 2, missing]
3-element Array{Union{Int64, Missings.Missing},1}:
 1       
 2       
  missing

julia> df = collect(Missings.replace(x, 1))
3-element Array{Int64,1}:
 1
 2
 1

and this not?:

input_file = "/home/js/Downloads/data-1512997404715.csv"
df = CSV.read(input_file, nullable=true)
x = collect(Missings.replace(df, "\\N"))
ERROR: MethodError: no method matching start(::DataFrames.DataFrame)
Closest candidates are:
  start(::SimpleVector) at essentials.jl:258
  start(::Base.MethodList) at reflection.jl:560
  start(::ExponentialBackOff) at error.jl:107

Stacktrace:
[1] copy!(::Array{Any,1}, ::Missings.EachReplaceMissing{DataFrames.DataFrame,String}) at ./abstractarray.jl:573
[2] _collect(::UnitRange{Int64}, ::Missings.EachReplaceMissing{DataFrames.DataFrame,String}, ::Base.HasEltype, ::Base.HasLength) at ./array.jl:437
[3] collect(::Missings.EachReplaceMissing{DataFrames.DataFrame,String}) at ./array.jl:431

I am building a script that create a sql dumpfile from a csv and want to replace all the missing values with "\N"

Regards
Johann

#2

The top-most level of the dataframe object / struct isn’t the iterable part

df1 = DataFrame(a=collect(1:10), b= collect(10:-1:1));
d2 = collect(Missings.replace(df.columns, 0));
df2 = DataFrame(a = d2[1], b= d2[2])
> Row | x1 | x2
> ...

#3

Thanks. You inspired me to look at the source and this seems to work now where df is the original dataframe:

df2 = DataFrame(collect(Missings.replace(df.columns, "\\N")), names(df))

#4

It doesn’t seem to be working properly though

df1 = DataFrame(a=[missing; collect(1:10)], b=[collect(10:-1:1); missing]);
d2 = collect(Missings.replace(df.columns, 0)); ## missing columns
d2 = collect(Missings.replace(df.columns[1], 0)); ## one column
d4= [collect(Missings.replace(a, 0)) for a in df.columns];

#5

Correct. My “solution” did not produce an error message but it did not do the job.

with

[collect(Missings.replace(a, 0)) for a in df.columns]

I now get

MethodError: Cannot `convert` an object of type String to an object of type Int64
This may have arisen from a call to the constructor Int64(...),
since type constructors fall back to convert methods.

Stacktrace:
 [1] replace(::Array{Union{Int64, Missings.Missing},1}, ::String) at /home/js/.julia/v0.6/Missings/src/Missings.jl:276
 [2] collect_to!(::Array{CategoricalArrays.CategoricalArray{String,1,UInt32,String,CategoricalArrays.CategoricalString{UInt32},Union{}},1}, ::Base.Generator{Array{Any,1},##21#22}, ::Int64, ::Int64) at ./array.jl:508
 [3] collect(::Base.Generator{Array{Any,1},##21#22}) at ./array.jl:476
 [4] include_string(::String, ::String) at ./loading.jl:522

Maybe I should first convert all the values in the dataframe to strings before trying this.
In the end it would be written as strings to an pg_dump file anyhow.
Not that I know how to do it at this stage. But I will find out.

Thanks for thinking with me.


#6

It’s close but still not quite

d = Dict([Union{Int64, Missing}=>0, Union{String, Missing}=>""])
d2 = [collect(Missings.replace(a, d[eltype(a)])) for a in df.columns];
> key union{Int64, Missing} not found 

Passing a full set of replacement missing values (one per column), as in for a,b in df.columns, dfmissingvec = [0, 0, "", "etc"] should be okay though


#7

My problem is that I will not know beforehand what the types of the columns will be.


#8

You can try something like:

for (v, name) in eachcol(df)
    df[name] = collect(Missings.replace(v, "\\N"))
end

collect(Missings.replace(v, "\\N")) can also be coalesce.(v, "\\N") or recode(v, missing => "\\N") (the latter is in CategoricalArrays). But if you don’t know the type of the columns in advance I don’t see how you could choose an appropriate replacement…


#9

Thanks. However that will only work on columns with a String type.

With names(df) I get an array of column headers. How can I use that to create a second dataframe with those names but of the type String for each column? If I can do that, I can possibly copy the non strings values to the second dataframe with string(v)?


#10

I think it v and name should be replaced in the for statement:

for (name, v) in eachcol(df)
    df[name] = collect(Missings.replace(v, "\\N"))
end