Importing CSV with missing data

pmarg · April 28, 2018, 10:42am

Hello everyone,

I’m having troubles with reading a CSV with missing data. If a single column has missing values then all the imported columns are of type Missing.

For example importing df.csv:

"X1","X2"
1, 3
2, 4
, 5

with

df=CSV.read("df.csv")

all the columns are of type

Union{Int64, Missings.Missing}

Instead, if I create the DataFrame myself with:

df1 = DataFrame(X1 = [1, 2, missing], X2 = [3, 4, 5])

only X1 is Union{Int64, Missings.Missing}.

My issue is that I want to use some functions that do not accept Missing types on columns that do not have missing values. I haven’t figured out a way to convert specific columns to a different type.

nalimilan · April 28, 2018, 12:20pm

Yes, columns always allow for missing values by default since the last release, as people kept complaining about errors due to missing values appearing after type detection. Pass allowmissing=:auto or allowmissing=:none to use a different behavior.

pmarg · April 28, 2018, 6:46pm

Thank you, that worked for the example file but when I try to load the actual CSV file I get the following error

ERROR: LoadError: BoundsError: attempt to access 0-element Array{UInt8,1} at index [1]
Stacktrace:
 [1] parsefield(::Base.AbstractIOBuffer{Array{UInt8,1}}, ::Type{Int64}, ::CSV.Options{DateFormat{Symbol("yyyy-mm-dd\\THH:MM:SS.s"),Tuple{Base.Dates.DatePart{'y'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'m'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'d'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'H'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'M'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'S'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'s'}}}}, ::Int64, ::Int64, ::Base.RefValue{CSV.ParsingState}, ::CSV.##4#5) at C:\Users\panag\.julia\v0.6\CSV\src\parsefields.jl:81
 [2] macro expansion at C:\Users\panag\.julia\v0.6\DataStreams\src\DataStreams.jl:542 [inlined]
 [3] stream!(::CSV.Source{Base.AbstractIOBuffer{Array{UInt8,1}},DateFormat{Symbol("yyyy-mm-dd\\THH:MM:SS.s"),Tuple{Base.Dates.DatePart{'y'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'m'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'d'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'H'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'M'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'S'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'s'}}}}, ::Type{DataStreams.Data.Field}, ::DataFrames.DataFrameStream{Tuple{Array{Int64,1},Array{Int64,1},Array{Int64,1},Array{Int64,1},Array{Int64,1},Array{Int64,1},Array{Int64,1},Array{Missings.Missing,1},Array{Missings.Missing,1},Array{Int64,1},Array{Int64,1},Array{Int64,1},Array{Int64,1},Array{Missings.Missing,1},Array{Missings.Missing,1},Array{Missings.Missing,1},Array{Int64,1},Array{Int64,1},Array{Int64,1},Array{Int64,1},Array{Int64,1},Array{Int64,1},Array{Int64,1},Array{Missings.Missing,1},Array{Missings.Missing,1},Array{Missings.Missing,1},Array{Missings.Missing,1},Array{Missings.Missing,1},Array{Missings.Missing,1},Array{Missings.Missing,1},Array{Float64,1},Array{Int64,1},Array{Int64,1},Array{Int64,1},Array{Float64,1}}}, ::DataStreams.Data.Schema{true,Tuple{Int64,Int64,Int64,Int64,Int64,Int64,Int64,Missings.Missing,Missings.Missing,Int64,Int64,Int64,Int64,Missings.Missing,Missings.Missing,Missings.Missing,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Missings.Missing,Missings.Missing,Missings.Missing,Missings.Missing,Missings.Missing,Missings.Missing,Missings.Missing,Float64,Int64,Int64,Int64,Float64}}, ::Int64, ::NTuple{35,Base.#identity}, ::DataStreams.Data.##15#16, ::Array{Any,1}, ::Type{Ref{(:YEAR, :SERIAL, :STRATA, :PSU, :NHISHID, :HHWEIGHT, :NHISPID, :HHX, :FMX, :PX, :PERNUM, :PERWEIGHT, :SAMPWEIGHT, :FWEIGHT, :ASTATFLG, :CSTATFLG, :AGE, :SEX, :EDUCREC2, :OCC, :OCC1995, :IND, :IND1995, :HOURSWRK, :OCCUPN104, :OCCUPN204, :INDSTRN104, :INDSTRN204, :POORYN, :INCFAM97ON2, :CPI2009, :HEALTH, :HEIGHT, :WEIGHT, :BMICALC)}}) at C:\Users\panag\.julia\v0.6\DataStreams\src\DataStreams.jl:614
 [4] #stream!#17(::Bool, ::Dict{Int64,Function}, ::Function, ::Array{Any,1}, ::Array{Any,1}, ::Function, ::CSV.Source{Base.AbstractIOBuffer{Array{UInt8,1}},DateFormat{Symbol("yyyy-mm-dd\\THH:MM:SS.s"),Tuple{Base.Dates.DatePart{'y'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'m'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'d'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'H'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'M'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'S'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'s'}}}}, ::Type{DataFrames.DataFrame}) at C:\Users\panag\.julia\v0.6\DataStreams\src\DataStreams.jl:490
 [5] (::DataStreams.Data.#kw##stream!)(::Array{Any,1}, ::DataStreams.Data.#stream!, ::CSV.Source{Base.AbstractIOBuffer{Array{UInt8,1}},DateFormat{Symbol("yyyy-mm-dd\\THH:MM:SS.s"),Tuple{Base.Dates.DatePart{'y'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'m'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'d'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'H'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'M'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'S'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'s'}}}}, ::Type{DataFrames.DataFrame}) at .\<missing>:0
 [6] #read#43(::Bool, ::Dict{Int64,Function}, ::Bool, ::Array{Any,1}, ::Function, ::String, ::Type{T} where T) at C:\Users\panag\.julia\v0.6\CSV\src\Source.jl:339
 [7] (::CSV.#kw##read)(::Array{Any,1}, ::CSV.#read, ::String, ::Type{T} where T) at .\<missing>:0 (repeats 2 times)
 [8] include_from_node1(::String) at .\loading.jl:576
 [9] include(::String) at .\sysimg.jl:14

The CSV is not corrupted because it loads properly with readtable(). Is there a similar option for readtable()?

nalimilan · April 28, 2018, 8:07pm

Hard to tell without a file to reproduce the problem locally unfortunately…

pmarg · April 29, 2018, 7:36am

Is there a way to safely send you the csv? It’s approximately 500MB.

nalimilan · April 29, 2018, 10:05am

Can you identify a subset of the rows which is enough to reproduce the problem? Then just send it by e-mail.

DoktorMike · April 29, 2018, 11:49am

Wetransfer is good for this. Up to 2 GB transfer without registration. However, if the data is very sensitive then it might not be the best way since you don’t know what wetransfer will do with the data.

nalimilan · April 29, 2018, 3:22pm

I’ve been able to create a minimal reproducer using @pmarg’s file. See https://github.com/JuliaData/CSV.jl/issues/200. It’s due to the appearance of letters in a column which only contains numbers up to row 18000. Possible workarounds include passing types=Dict("px"=>String) or rows_for_type_detect=20_000.

Tamas_Papp · April 30, 2018, 5:39am

I wonder if reading all rows for type detection would be a reasonable default. It would of course be somewhat slower, but would also avoid error messages like this. Experienced users with large datasets would just turn it off.

pmarg · April 30, 2018, 6:49am

I think a more informative error that is going to be implemented is a step in the right direction because when I tried later to load the same files with JuliaDB the error stated the problem with more clarity. I could see the exact line and the value of the String so I could replace the value.

The problem is that I had to use Stata to do that because if the files don’t load then we cannot clean them. I think a warning is more appropriate so that users are aware that there is an issue and they can do something about it.

nalimilan · April 30, 2018, 8:07am

Ideally we would start again automatically with a wider type when encountering an error. Shouldn’t be too hard to implement, but maybe until then it would indeed make sense to use rows_for_type_detect=typemax(Int) by default. Feel free to make a PR to prompt a discussion.

piever · April 30, 2018, 9:42am

It seems like it would make sense to use something like https://github.com/JuliaComputing/IndexedTables.jl/blob/master/src/collect.jl#L39 that collects an iterable of tuples as a tuple of arrays widening the type where needed. I’m trying to figure out what’s the best code reorganization for all data packages to use this without depending on IndexedTables (which is somewhat heavyweight). My initial attempt has been transferring all the “vector of tuples into tuple of vectors” kind of thing to the dependency free StructArrays (still wip and julia 0.7 only).

Also, the same could be done for missing data (start strict, widen where needed): this should work easily with promote_typejoin on Julia 0.7.

nalimilan · April 30, 2018, 10:13am

I’ve thought about that, but unfortunately I don’t think that would really work here. Consider a column which contains 00 for all first 1000 rows, but then contains A. The first values will have been parsed as Int[0, ..., 0] but after encountering A we would have to convert it to a Vector{String}, which isn’t possible without losing some information (e.g. we would get "0" instead of "00"). So AFAICT we have no choice but reparsing the column from scratch.

OTOH this strategy would work for missing data.

davidanthoff · April 30, 2018, 5:10pm

I think TextParse.jl handles all of this gracefully by default, automatically widens a column type if necessary etc.? If so (I haven’t double checked!), then CSVFiles.jl already exposes that via an easy API to interact with table types (in fact all iterable tables). Probably worth trying that on the dataset.

Topic		Replies	Views
CSV.read stopped working Data	5	1141	March 5, 2018
How to read a table with missing values using CSV? Data package , data	9	7381	February 20, 2018
Loading .csv into a DataFrame with missing insertion and column-type correction (one-linerish) Data dataframes , csv	1	437	April 27, 2021
A question about data type General Usage question	1	338	June 11, 2020
CSV problem reading empty string column Data	2	2398	December 13, 2017

Importing CSV with missing data

Related topics