CSV misread

How do I make it recognize the quotes?

shell> cat > test.csv

julia> CSV.read("test.csv")
2×6 DataFrames.DataFrame
│ Row │ a       │ b   │ c     │ d      │ e   │ f          │
│ 1   │ 1863001 │ 134 │ 10000 │ 1.0009 │ 1.0 │ -0.0020339 │
│ 2   │ 1863209 │ 137 │ 0     │ 1.0    │ 2.0 │ 773.9      │

CSVFiles.jl gets it right:

julia> using FileIO, CSVFiles, DataFrames

julia> load("test.csv") |> DataFrame
2×6 DataFrames.DataFrame
│ Row │ a       │ b   │ c     │ d      │ e          │ f          │
│ 1   │ 1863001 │ 134 │ 10000 │ 1.0009 │ 1.0000     │ -0.0020339 │
│ 2   │ 1863209 │ 137 │ 0     │ 1.0    │ 2,773.9000 │ missing    │

File an issue on CSV.jl?

Actually, CSVFiles also breaks when I have more rows… Try this:

shell> cat > test3.csv

julia> DataFrame(load("test3.csv"))
MethodError: Cannot `convert` an object of type Float64 to an object of type TextParse.StrRange
This may have arisen from a call to the constructor TextParse.StrRange(...),
since type constructors fall back to convert methods.ERROR: CSV parsing error in test3.csv at line 24 char 21:
column 5 is expected to be: TextParse.Field{Float64,TextParse.Numeric{Float64}}(<Float64>, true, true, false)
 [1] copy!(::Array{TextParse.StrRange,1}, ::Int64, ::Array{Float64,1}, ::Int64, ::Int64) at ./abstractarray.jl:691
 [2] promote_column(::Array{Float64,1}, ::Int64, ::Type{T} where T, ::Bool) at /opt/julia/share/julia/site/v0.6/TextParse/src/csv.jl:460
 [3] promote_field(::String, ::TextParse.Field{Float64,TextParse.Numeric{Float64}}, ::Array{Float64,1}, ::TextParse.CSVParseError, ::Array{String,1}) at /opt/julia/share/julia/site/v0.6/TextParse/src/csv.jl:425
 [4] (::TextParse.##39#43{DataStructures.OrderedDict{Union{Int64, String},AbstractArray{T,1} where T}})(::String, ::Int64) at /opt/julia/share/julia/site/v0.6/TextParse/src/csv.jl:337
 [5] collect(::Base.Generator{Base.Iterators.Zip2{Array{String,1},UnitRange{Int64}},Base.##3#4{TextParse.##39#43{DataStructures.OrderedDict{Union{Int64, String},AbstractArray{T,1} where T}}}}) at ./array.jl:475
 [6] #_csvread_internal#35(::Bool, ::Char, ::Char, ::Bool, ::Bool, ::Int64, ::Void, ::Int64, ::Void, ::Bool, ::Array{String,1}, ::Array{String,1}, ::DataStructures.OrderedDict{Union{Int64, String},AbstractArray{T,1} where T}, ::Int64, ::Void, ::Array{Any,1}, ::String, ::Int64, ::TextParse.#_csvread_internal, ::String, ::Char) at /opt/julia/share/julia/site/v0.6/TextParse/src/csv.jl:333
 [7] (::TextParse.#kw##_csvread_internal)(::Array{Any,1}, ::TextParse.#_csvread_internal, ::String, ::Char) at ./<missing>:0
 [8] (::TextParse.##31#33{Array{Any,1},String,Char})(::IOStream) at /opt/julia/share/julia/site/v0.6/TextParse/src/csv.jl:97
 [9] open(::TextParse.##31#33{Array{Any,1},String,Char}, ::String, ::String) at ./iostream.jl:152
 [10] #_csvread_f#29(::Array{Any,1}, ::Function, ::String, ::Char) at /opt/julia/share/julia/site/v0.6/TextParse/src/csv.jl:95
 [11] #csvread#25(::Array{Any,1}, ::Function, ::String, ::Char) at /opt/julia/share/julia/site/v0.6/TextParse/src/csv.jl:69
 [12] getiterator(::CSVFiles.CSVFile) at /opt/julia/share/julia/site/v0.6/CSVFiles/src/CSVFiles.jl:49
 [13] _DataFrame(::CSVFiles.CSVFile) at /opt/julia/share/julia/site/v0.6/IterableTables/src/integrations/dataframes-missing.jl:100
 [14] DataFrames.DataFrame(::CSVFiles.CSVFile) at /opt/julia/share/julia/site/v0.6/IterableTables/src/integrations/dataframes-missing.jl:129

good idea :slight_smile:


You can specify that it should use more than the default 20 rows for figuring out the types of the columns with the type_detect_rows=24 option. Not ideal, it would be nicer if it actually just promoted the columns, I opened an issue for that.

Thanks David. Do you think it would cause any slowdown? The file above is a dummy file for testing and the real one comes about at around 150,000th line.

I don’t understand the promotion part. The column should have been parsed as Float64. While the subsequent lines are quoted & localized, they are still Float64. Either way, it should be just parsing a string into a Float64…

Ah, you want something different than what the row detect gives you! You want the column to be parsed as a Float64, whereas adding more lines to the type detection will just parse the whole column as a String.

So I guess the real question here is whether a float that is surrounded by quotes should be read as a Float64 or as a String… Or rather, whether the parser that handles Float64 columns could be made tolerant enough to also accept quoted floats. I’m not sure it should, actually… Does anyone know how CSV parsers on other platforms handle that?

Here’s Pandas in Python3:

>>> pandas.read_csv("foo.csv")
         a    b      c       d           e         f
0  1863001  134  10000  1.0009      1.0000 -0.002034
1  1863209  137      0  1.0000  2,773.9000       NaN

Thanks! It is not entirely clear to me whether column e is float or string, though?

It’s parsed as a string:

>>> t["e"][1]

Actually, that whole column has dtype=object and the first row has also become a string:

>>> t["e"][0]

Ok, thanks.

@tk3369 you can get the same behavior as pandas by either specifying that more rows should be used for type detection (that will occur overhead), or you can specify the specific type to be used for the e column up front (I forgot the syntax for that right now, but the TextParse.jl doc should describe that). I actually think TextParse.jl should just widen the column type to String automatically in this situation, but that is not how TextParse.jl works right now (I did open an issue there on that).

I do think that the column should be String whenever there is at least one row that has stuff quoted. But one could probably think about the following: if someone explicitly specifies the column as `Float64, the parser could become a bit more forgiving and also handle floats that are in quotes…

Thanks, all. I have since worked around the problem (not having to parse these malformed csv files and get the data in a different form.)