Importing CSV with missing data

dataframes

#1

Hello everyone,

I’m having troubles with reading a CSV with missing data. If a single column has missing values then all the imported columns are of type Missing.

For example importing df.csv:

"X1","X2"
1, 3
2, 4
, 5

with

df=CSV.read("df.csv")

all the columns are of type

Union{Int64, Missings.Missing}

Instead, if I create the DataFrame myself with:

df1 = DataFrame(X1 = [1, 2, missing], X2 = [3, 4, 5])

only X1 is Union{Int64, Missings.Missing}.

My issue is that I want to use some functions that do not accept Missing types on columns that do not have missing values. I haven’t figured out a way to convert specific columns to a different type.


#2

Yes, columns always allow for missing values by default since the last release, as people kept complaining about errors due to missing values appearing after type detection. Pass allowmissing=:auto or allowmissing=:none to use a different behavior.


#3

Thank you, that worked for the example file but when I try to load the actual CSV file I get the following error

ERROR: LoadError: BoundsError: attempt to access 0-element Array{UInt8,1} at index [1]
Stacktrace:
 [1] parsefield(::Base.AbstractIOBuffer{Array{UInt8,1}}, ::Type{Int64}, ::CSV.Options{DateFormat{Symbol("yyyy-mm-dd\\THH:MM:SS.s"),Tuple{Base.Dates.DatePart{'y'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'m'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'d'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'H'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'M'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'S'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'s'}}}}, ::Int64, ::Int64, ::Base.RefValue{CSV.ParsingState}, ::CSV.##4#5) at C:\Users\panag\.julia\v0.6\CSV\src\parsefields.jl:81
 [2] macro expansion at C:\Users\panag\.julia\v0.6\DataStreams\src\DataStreams.jl:542 [inlined]
 [3] stream!(::CSV.Source{Base.AbstractIOBuffer{Array{UInt8,1}},DateFormat{Symbol("yyyy-mm-dd\\THH:MM:SS.s"),Tuple{Base.Dates.DatePart{'y'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'m'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'d'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'H'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'M'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'S'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'s'}}}}, ::Type{DataStreams.Data.Field}, ::DataFrames.DataFrameStream{Tuple{Array{Int64,1},Array{Int64,1},Array{Int64,1},Array{Int64,1},Array{Int64,1},Array{Int64,1},Array{Int64,1},Array{Missings.Missing,1},Array{Missings.Missing,1},Array{Int64,1},Array{Int64,1},Array{Int64,1},Array{Int64,1},Array{Missings.Missing,1},Array{Missings.Missing,1},Array{Missings.Missing,1},Array{Int64,1},Array{Int64,1},Array{Int64,1},Array{Int64,1},Array{Int64,1},Array{Int64,1},Array{Int64,1},Array{Missings.Missing,1},Array{Missings.Missing,1},Array{Missings.Missing,1},Array{Missings.Missing,1},Array{Missings.Missing,1},Array{Missings.Missing,1},Array{Missings.Missing,1},Array{Float64,1},Array{Int64,1},Array{Int64,1},Array{Int64,1},Array{Float64,1}}}, ::DataStreams.Data.Schema{true,Tuple{Int64,Int64,Int64,Int64,Int64,Int64,Int64,Missings.Missing,Missings.Missing,Int64,Int64,Int64,Int64,Missings.Missing,Missings.Missing,Missings.Missing,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Missings.Missing,Missings.Missing,Missings.Missing,Missings.Missing,Missings.Missing,Missings.Missing,Missings.Missing,Float64,Int64,Int64,Int64,Float64}}, ::Int64, ::NTuple{35,Base.#identity}, ::DataStreams.Data.##15#16, ::Array{Any,1}, ::Type{Ref{(:YEAR, :SERIAL, :STRATA, :PSU, :NHISHID, :HHWEIGHT, :NHISPID, :HHX, :FMX, :PX, :PERNUM, :PERWEIGHT, :SAMPWEIGHT, :FWEIGHT, :ASTATFLG, :CSTATFLG, :AGE, :SEX, :EDUCREC2, :OCC, :OCC1995, :IND, :IND1995, :HOURSWRK, :OCCUPN104, :OCCUPN204, :INDSTRN104, :INDSTRN204, :POORYN, :INCFAM97ON2, :CPI2009, :HEALTH, :HEIGHT, :WEIGHT, :BMICALC)}}) at C:\Users\panag\.julia\v0.6\DataStreams\src\DataStreams.jl:614
 [4] #stream!#17(::Bool, ::Dict{Int64,Function}, ::Function, ::Array{Any,1}, ::Array{Any,1}, ::Function, ::CSV.Source{Base.AbstractIOBuffer{Array{UInt8,1}},DateFormat{Symbol("yyyy-mm-dd\\THH:MM:SS.s"),Tuple{Base.Dates.DatePart{'y'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'m'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'d'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'H'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'M'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'S'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'s'}}}}, ::Type{DataFrames.DataFrame}) at C:\Users\panag\.julia\v0.6\DataStreams\src\DataStreams.jl:490
 [5] (::DataStreams.Data.#kw##stream!)(::Array{Any,1}, ::DataStreams.Data.#stream!, ::CSV.Source{Base.AbstractIOBuffer{Array{UInt8,1}},DateFormat{Symbol("yyyy-mm-dd\\THH:MM:SS.s"),Tuple{Base.Dates.DatePart{'y'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'m'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'d'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'H'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'M'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'S'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'s'}}}}, ::Type{DataFrames.DataFrame}) at .\<missing>:0
 [6] #read#43(::Bool, ::Dict{Int64,Function}, ::Bool, ::Array{Any,1}, ::Function, ::String, ::Type{T} where T) at C:\Users\panag\.julia\v0.6\CSV\src\Source.jl:339
 [7] (::CSV.#kw##read)(::Array{Any,1}, ::CSV.#read, ::String, ::Type{T} where T) at .\<missing>:0 (repeats 2 times)
 [8] include_from_node1(::String) at .\loading.jl:576
 [9] include(::String) at .\sysimg.jl:14

The CSV is not corrupted because it loads properly with readtable(). Is there a similar option for readtable()?


#4

Hard to tell without a file to reproduce the problem locally unfortunately…


#5

Is there a way to safely send you the csv? It’s approximately 500MB.


#6

Can you identify a subset of the rows which is enough to reproduce the problem? Then just send it by e-mail.


#7

Wetransfer is good for this. Up to 2 GB transfer without registration. However, if the data is very sensitive then it might not be the best way since you don’t know what wetransfer will do with the data. :blush:


#8

I’ve been able to create a minimal reproducer using @pmarg’s file. See https://github.com/JuliaData/CSV.jl/issues/200. It’s due to the appearance of letters in a column which only contains numbers up to row 18000. Possible workarounds include passing types=Dict("px"=>String) or rows_for_type_detect=20_000.


#9

I wonder if reading all rows for type detection would be a reasonable default. It would of course be somewhat slower, but would also avoid error messages like this. Experienced users with large datasets would just turn it off.


#10

I think a more informative error that is going to be implemented is a step in the right direction because when I tried later to load the same files with JuliaDB the error stated the problem with more clarity. I could see the exact line and the value of the String so I could replace the value.

The problem is that I had to use Stata to do that because if the files don’t load then we cannot clean them. I think a warning is more appropriate so that users are aware that there is an issue and they can do something about it.


#11

Ideally we would start again automatically with a wider type when encountering an error. Shouldn’t be too hard to implement, but maybe until then it would indeed make sense to use rows_for_type_detect=typemax(Int) by default. Feel free to make a PR to prompt a discussion.


#12

It seems like it would make sense to use something like https://github.com/JuliaComputing/IndexedTables.jl/blob/master/src/collect.jl#L39 that collects an iterable of tuples as a tuple of arrays widening the type where needed. I’m trying to figure out what’s the best code reorganization for all data packages to use this without depending on IndexedTables (which is somewhat heavyweight). My initial attempt has been transferring all the “vector of tuples into tuple of vectors” kind of thing to the dependency free StructArrays (still wip and julia 0.7 only).

Also, the same could be done for missing data (start strict, widen where needed): this should work easily with promote_typejoin on Julia 0.7.


#13

I’ve thought about that, but unfortunately I don’t think that would really work here. Consider a column which contains 00 for all first 1000 rows, but then contains A. The first values will have been parsed as Int[0, ..., 0] but after encountering A we would have to convert it to a Vector{String}, which isn’t possible without losing some information (e.g. we would get "0" instead of "00"). So AFAICT we have no choice but reparsing the column from scratch.

OTOH this strategy would work for missing data.


#14

I think TextParse.jl handles all of this gracefully by default, automatically widens a column type if necessary etc.? If so (I haven’t double checked!), then CSVFiles.jl already exposes that via an easy API to interact with table types (in fact all iterable tables). Probably worth trying that on the dataset.