How to read a table with missing values using CSV?

Hi,

I read a table where some values are missing using :
f = CSV.read(“data/$file”)

When the missing values is ‘NA’ in the table, I obtain :
ERROR: CSV.ParsingException(“error parsing a Float64 value on column 27, row 65; encountered ‘N’”)

if I try
julia> f = CSV.read(“data/merge.csv”; delim=‘\t’, header=true, null=“NA”)
ERROR: MissingException: encountered a missing value for a non-null column type on row = 65, col = 27

if I remove all the “NA”, I obtain :
julia> f = CSV.read(“data/merge.csv”; delim=‘\t’, header=true, null=“”)
ERROR: MissingException: encountered a missing value for a non-null column type on row = 65, col = 27

So after deprecation of readtable() in DataFrames, I am no longer able to read a table with missing values :roll_eyes:

Thanks for your help !

1 Like

Yeah, that’s a known problem due to the fact that no missing values appear in the first rows which are used to detect column types. The package will likely be improved soon, but for now you can work around this by specifying the element type of the offending columns manually with the types argument. Of course you can also continue using readtable, it will continue to work until the new major release.

I got the same issue so would appreciate some insights. I have been trying to change the rows to look at and also changing the quotechar

Thank you @nalimilan !

It is not so easy :wink:

julia> using DataFrames

julia> x= readtable("data/merge.csv", separator = '\t', header=true)
WARNING: readtable is deprecated, use CSV.read from the CSV package instead
 Stacktrace:
 [1] depwarn(::String, ::Symbol) at ./deprecated.jl:70
 [2] #readtable#199(::Bool, ::Char, ::Array{Char,1}, ::Char, ::Array{String,1}, ::Array{String,1}, ::Array{String,1}, ::Bool, ::Int64, ::Array{Symbol,1}, ::Array{Any,1}, ::Bool, ::Char, ::Bool, ::Int64, ::Array{Int64,1}, ::Bool, ::Symbol, ::Bool, ::Bool, ::DataFrames.#readtable, ::String) at /home/fred/.julia/v0.6/DataFrames/src/deprecated.jl:1045

A deprecation warning is not an error, you can ignore it and use the result anyway.

@nalimilan I did more tests and it is strange because CSV is able to read “some” tables with missing values but not others and DataFrames, in my test, is not able to read the same simple table at all. I copy here my whole session if it can help

  julia> using DataFrames

julia> using CSV

julia> f = CSV.read("data/test.csv"; delim='\t', header=true, null="")
4×4 DataFrames.DataFrame
│ Row │   │ C1      │ C2      │ C3      │
├─────┼───┼─────────┼─────────┼─────────┤
│ 1   │ A │ 1       │ 2       │ 3       │
│ 2   │ B │ 1104    │ missing │ missing │
│ 3   │ C │ missing │ missing │ missing │
│ 4   │ D │ 132     │ missing │ missing │

julia> x = readtable("data/test.csv", separator = '\t', header=true)
WARNING: readtable is deprecated, use CSV.read from the CSV package instead
Stacktrace:
 [1] depwarn(::String, ::Symbol) at ./deprecated.jl:70
 [2] #readtable#199(::Bool, ::Char, ::Array{Char,1}, ::Char, ::Array{String,1}, ::Array{String,1}, ::Array{String,1}, ::Bool, ::Int64, ::Array{Symbol,1}, ::Array{Any,1}, ::Bool, ::Char, ::Bool, ::Int64, ::Array{Int64,1}, ::Bool, ::Symbol, ::Bool, ::Bool, ::DataFrames.#readtable, ::String) at /home/fred/.julia/v0.6/DataFrames/src/deprecated.jl:1045
 [3] (::DataFrames.#kw##readtable)(::Array{Any,1}, ::DataFrames.#readtable, ::String) at ./<missing>:0
 [4] eval(::Module, ::Any) at ./boot.jl:235
 [5] eval_user_input(::Any, ::Base.REPL.REPLBackend) at ./REPL.jl:66
 [6] macro expansion at ./REPL.jl:97 [inlined]
 [7] (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:73
while loading no file, in expression starting on line 0
ERROR: MethodError: Cannot `convert` an object of type BitArray{1} to an object of type Int64
This may have arisen from a call to the constructor Int64(...),
since type constructors fall back to convert methods.
Stacktrace:
 [1] setindex!(::Array{Union{Int64, Missings.Missing},1}, ::BitArray{1}, ::Int64) at ./array.jl:583
 [2] builddf(::Int64, ::Int64, ::Int64, ::Int64, ::DataFrames.ParsedCSV, ::DataFrames.ParseOptions{String,String}) at /home/fred/.julia/v0.6/DataFrames/src/deprecated.jl:755
 [3] readtable!(::DataFrames.ParsedCSV, ::IOStream, ::Int64, ::DataFrames.ParseOptions{String,String}) at /home/fred/.julia/v0.6/DataFrames/src/deprecated.jl:907
 [4] #readtable#198(::Bool, ::Char, ::Array{Char,1}, ::Char, ::Array{String,1}, ::Array{String,1}, ::Array{String,1}, ::Bool, ::Int64, ::Array{Symbol,1}, ::Array{Any,1}, ::Bool, ::Char, ::Bool, ::Int64, ::Array{Int64,1}, ::Bool, ::Symbol, ::Bool, ::Bool, ::DataFrames.#readtable, ::IOStream, ::Int64) at /home/fred/.julia/v0.6/DataFrames/src/deprecated.jl:966
 [5] (::DataFrames.#kw##readtable)(::Array{Any,1}, ::DataFrames.#readtable, ::IOStream, ::Int64) at ./<missing>:0
 [6] #readtable#199(::Bool, ::Char, ::Array{Char,1}, ::Char, ::Array{String,1}, ::Array{String,1}, ::Array{String,1}, ::Bool, ::Int64, ::Array{Symbol,1}, ::Array{Any,1}, ::Bool, ::Char, ::Bool, ::Int64, ::Array{Int64,1}, ::Bool, ::Symbol, ::Bool, ::Bool, ::DataFrames.#readtable, ::String) at /home/fred/.julia/v0.6/DataFrames/src/deprecated.jl:1065
 [7] (::DataFrames.#kw##readtable)(::Array{Any,1}, ::DataFrames.#readtable, ::String) at ./<missing>:0

julia> x
ERROR: UndefVarError: x not defined
1 Like

Woops, good catch indeed! Turns out when we ported from null to missing, a name conflict appeared with a BitArray object which was called missing… And since we no longer test readtable (because it’s not supposed to evolve anymore)…

Until the new release, you can use the new version by running Pkg.checkout("DataFrames", "nl/readtable") (use Pkg.free("DataFrames") to go back to standard releases).

Regarding CSV failures, they will happen everytime there is no missing value in the first rows of the file.

You can also try https://github.com/davidanthoff/CSVFiles.jl, it uses a different parser under the hood (GitHub - queryverse/TextParse.jl: A bunch of fast text parsing tools). I don’t know whether that one handles this better, but worth a try.

@davidanthoff

I tried your package “CSVfiles” and I had good results with it :wink:

With “CSV” I had to set
rows_for_type_detect=1000
to read my file, whereas with “CSVfiles” I had no error.

With very large tables (> 140 Mo) “CSVfiles” was a little bit slower than “CSV” .

julia> @time myData = CSV.read("AffyST1.csv"; header=true, rows_for_type_detect=1000)
  0.326732 seconds (265.24 k allocations: 8.164 MiB)
257430×1 DataFrames.DataFrame. Omitted printing of 1 columns

julia> @time df = DataFrame(load("AffyST1.csv"))
  0.536061 seconds (516.32 k allocations: 294.887 MiB)
257430×1 DataFrames.DataFrame. Omitted printing of 1 columns

Cool, thanks! There is a known overhead in my CSVFiles implementation that I hope to get rid of soon, maybe that will close the gap with CSV.jl.