How to read a table with missing values using CSV?

Fred · December 6, 2017, 8:03am

Hi,

I read a table where some values are missing using :
f = CSV.read(“data/$file”)

When the missing values is ‘NA’ in the table, I obtain :
ERROR: CSV.ParsingException(“error parsing a Float64 value on column 27, row 65; encountered ‘N’”)

if I try
julia> f = CSV.read(“data/merge.csv”; delim=‘\t’, header=true, null=“NA”)
ERROR: MissingException: encountered a missing value for a non-null column type on row = 65, col = 27

if I remove all the “NA”, I obtain :
julia> f = CSV.read(“data/merge.csv”; delim=‘\t’, header=true, null=“”)
ERROR: MissingException: encountered a missing value for a non-null column type on row = 65, col = 27

So after deprecation of readtable() in DataFrames, I am no longer able to read a table with missing values

Thanks for your help !

nalimilan · December 6, 2017, 8:35am

Yeah, that’s a known problem due to the fact that no missing values appear in the first rows which are used to detect column types. The package will likely be improved soon, but for now you can work around this by specifying the element type of the offending columns manually with the types argument. Of course you can also continue using readtable, it will continue to work until the new major release.

xiaodai · December 6, 2017, 8:35am

I got the same issue so would appreciate some insights. I have been trying to change the rows to look at and also changing the quotechar

Fred · December 6, 2017, 8:39am

Thank you @nalimilan !

It is not so easy

julia> using DataFrames

julia> x= readtable("data/merge.csv", separator = '\t', header=true)
WARNING: readtable is deprecated, use CSV.read from the CSV package instead
 Stacktrace:
 [1] depwarn(::String, ::Symbol) at ./deprecated.jl:70
 [2] #readtable#199(::Bool, ::Char, ::Array{Char,1}, ::Char, ::Array{String,1}, ::Array{String,1}, ::Array{String,1}, ::Bool, ::Int64, ::Array{Symbol,1}, ::Array{Any,1}, ::Bool, ::Char, ::Bool, ::Int64, ::Array{Int64,1}, ::Bool, ::Symbol, ::Bool, ::Bool, ::DataFrames.#readtable, ::String) at /home/fred/.julia/v0.6/DataFrames/src/deprecated.jl:1045

nalimilan · December 6, 2017, 8:55am

A deprecation warning is not an error, you can ignore it and use the result anyway.

Fred · December 6, 2017, 12:18pm

@nalimilan I did more tests and it is strange because CSV is able to read “some” tables with missing values but not others and DataFrames, in my test, is not able to read the same simple table at all. I copy here my whole session if it can help

  julia> using DataFrames

julia> using CSV

julia> f = CSV.read("data/test.csv"; delim='\t', header=true, null="")
4×4 DataFrames.DataFrame
│ Row │   │ C1      │ C2      │ C3      │
├─────┼───┼─────────┼─────────┼─────────┤
│ 1   │ A │ 1       │ 2       │ 3       │
│ 2   │ B │ 1104    │ missing │ missing │
│ 3   │ C │ missing │ missing │ missing │
│ 4   │ D │ 132     │ missing │ missing │

julia> x = readtable("data/test.csv", separator = '\t', header=true)
WARNING: readtable is deprecated, use CSV.read from the CSV package instead
Stacktrace:
 [1] depwarn(::String, ::Symbol) at ./deprecated.jl:70
 [2] #readtable#199(::Bool, ::Char, ::Array{Char,1}, ::Char, ::Array{String,1}, ::Array{String,1}, ::Array{String,1}, ::Bool, ::Int64, ::Array{Symbol,1}, ::Array{Any,1}, ::Bool, ::Char, ::Bool, ::Int64, ::Array{Int64,1}, ::Bool, ::Symbol, ::Bool, ::Bool, ::DataFrames.#readtable, ::String) at /home/fred/.julia/v0.6/DataFrames/src/deprecated.jl:1045
 [3] (::DataFrames.#kw##readtable)(::Array{Any,1}, ::DataFrames.#readtable, ::String) at ./<missing>:0
 [4] eval(::Module, ::Any) at ./boot.jl:235
 [5] eval_user_input(::Any, ::Base.REPL.REPLBackend) at ./REPL.jl:66
 [6] macro expansion at ./REPL.jl:97 [inlined]
 [7] (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:73
while loading no file, in expression starting on line 0
ERROR: MethodError: Cannot `convert` an object of type BitArray{1} to an object of type Int64
This may have arisen from a call to the constructor Int64(...),
since type constructors fall back to convert methods.
Stacktrace:
 [1] setindex!(::Array{Union{Int64, Missings.Missing},1}, ::BitArray{1}, ::Int64) at ./array.jl:583
 [2] builddf(::Int64, ::Int64, ::Int64, ::Int64, ::DataFrames.ParsedCSV, ::DataFrames.ParseOptions{String,String}) at /home/fred/.julia/v0.6/DataFrames/src/deprecated.jl:755
 [3] readtable!(::DataFrames.ParsedCSV, ::IOStream, ::Int64, ::DataFrames.ParseOptions{String,String}) at /home/fred/.julia/v0.6/DataFrames/src/deprecated.jl:907
 [4] #readtable#198(::Bool, ::Char, ::Array{Char,1}, ::Char, ::Array{String,1}, ::Array{String,1}, ::Array{String,1}, ::Bool, ::Int64, ::Array{Symbol,1}, ::Array{Any,1}, ::Bool, ::Char, ::Bool, ::Int64, ::Array{Int64,1}, ::Bool, ::Symbol, ::Bool, ::Bool, ::DataFrames.#readtable, ::IOStream, ::Int64) at /home/fred/.julia/v0.6/DataFrames/src/deprecated.jl:966
 [5] (::DataFrames.#kw##readtable)(::Array{Any,1}, ::DataFrames.#readtable, ::IOStream, ::Int64) at ./<missing>:0
 [6] #readtable#199(::Bool, ::Char, ::Array{Char,1}, ::Char, ::Array{String,1}, ::Array{String,1}, ::Array{String,1}, ::Bool, ::Int64, ::Array{Symbol,1}, ::Array{Any,1}, ::Bool, ::Char, ::Bool, ::Int64, ::Array{Int64,1}, ::Bool, ::Symbol, ::Bool, ::Bool, ::DataFrames.#readtable, ::String) at /home/fred/.julia/v0.6/DataFrames/src/deprecated.jl:1065
 [7] (::DataFrames.#kw##readtable)(::Array{Any,1}, ::DataFrames.#readtable, ::String) at ./<missing>:0

julia> x
ERROR: UndefVarError: x not defined

nalimilan · December 6, 2017, 12:53pm

Woops, good catch indeed! Turns out when we ported from null to missing, a name conflict appeared with a BitArray object which was called missing… And since we no longer test readtable (because it’s not supposed to evolve anymore)…

Until the new release, you can use the new version by running Pkg.checkout("DataFrames", "nl/readtable") (use Pkg.free("DataFrames") to go back to standard releases).

Regarding CSV failures, they will happen everytime there is no missing value in the first rows of the file.

davidanthoff · December 17, 2017, 5:37am

You can also try https://github.com/davidanthoff/CSVFiles.jl, it uses a different parser under the hood (GitHub - queryverse/TextParse.jl: A bunch of fast text parsing tools). I don’t know whether that one handles this better, but worth a try.

Fred · February 20, 2018, 7:54am

@davidanthoff

I tried your package “CSVfiles” and I had good results with it

With “CSV” I had to set
rows_for_type_detect=1000
to read my file, whereas with “CSVfiles” I had no error.

With very large tables (> 140 Mo) “CSVfiles” was a little bit slower than “CSV” .

julia> @time myData = CSV.read("AffyST1.csv"; header=true, rows_for_type_detect=1000)
  0.326732 seconds (265.24 k allocations: 8.164 MiB)
257430×1 DataFrames.DataFrame. Omitted printing of 1 columns

julia> @time df = DataFrame(load("AffyST1.csv"))
  0.536061 seconds (516.32 k allocations: 294.887 MiB)
257430×1 DataFrames.DataFrame. Omitted printing of 1 columns

davidanthoff · February 20, 2018, 4:16pm

Cool, thanks! There is a known overhead in my CSVFiles implementation that I hope to get rid of soon, maybe that will close the gap with CSV.jl.

Topic		Replies	Views
Read NULL as missing in dataframe General Usage dataframes , csv	1	972	June 30, 2021
CSV.read stopped working Data	5	1141	March 5, 2018
Importing CSV with missing data Data dataframes	13	4523	April 30, 2018
How to test missing value in CSV package? Data package , data	11	3698	December 7, 2018
Csv and dataframes and missing values New to Julia dataframes , csv	9	2182	June 12, 2021

How to read a table with missing values using CSV?

Related topics