CSV misread


#1

How do I make it recognize the quotes?

shell> cat > test.csv
a,b,c,d,e,f
1863001,134,10000,1.0009,1.0000,-0.002033899
1863209,137,0,1.0000,"2,773.9000",

julia> CSV.read("test.csv")
2×6 DataFrames.DataFrame
│ Row │ a       │ b   │ c     │ d      │ e   │ f          │
├─────┼─────────┼─────┼───────┼────────┼─────┼────────────┤
│ 1   │ 1863001 │ 134 │ 10000 │ 1.0009 │ 1.0 │ -0.0020339 │
│ 2   │ 1863209 │ 137 │ 0     │ 1.0    │ 2.0 │ 773.9      │


#2

CSVFiles.jl gets it right:

julia> using FileIO, CSVFiles, DataFrames

julia> load("test.csv") |> DataFrame
2×6 DataFrames.DataFrame
│ Row │ a       │ b   │ c     │ d      │ e          │ f          │
├─────┼─────────┼─────┼───────┼────────┼────────────┼────────────┤
│ 1   │ 1863001 │ 134 │ 10000 │ 1.0009 │ 1.0000     │ -0.0020339 │
│ 2   │ 1863209 │ 137 │ 0     │ 1.0    │ 2,773.9000 │ missing    │

#3

File an issue on CSV.jl?


#4

Actually, CSVFiles also breaks when I have more rows… Try this:

shell> cat > test3.csv
a,b,c,d,e,f
1863001,134,9981,1.0019,1.0000,0.0246182823
1863001,134,9982,1.0046,1.0000,0.0057100506
1863001,134,9983,1.0186,1.0000,-0.007331108
1863001,134,9984,1.0069,1.0000,0.0052499037
1863001,134,9985,0.9994,1.0000,0.0017077061
1863001,134,9986,1.0009,1.0000,0.0147836641
1863001,134,9987,0.9968,1.0000,-0.004429422
1863001,134,9988,1.0006,1.0000,0.0047920628
1863001,134,9989,1.0143,1.0000,-0.009376122
1863001,134,9990,1.0268,1.0000,-0.002704543
1863001,134,9991,1.0124,1.0000,-0.005628563
1863001,134,9992,1.0053,1.0000,0.0098183902
1863001,134,9993,1.0026,1.0000,0.0098087401
1863001,134,9994,0.9935,1.0000,0.0029873303
1863001,134,9995,1.0112,1.0000,0.0021465219
1863001,134,9996,0.9926,1.0000,-0.006265291
1863001,134,9997,0.9968,1.0000,0.0061212349
1863001,134,9998,0.9999,1.0000,-0.00676656
1863001,134,9999,1.0012,1.0000,0.0010823933
1863001,134,10000,1.0009,1.0000,-0.002033899
1863209,137,0,1.0000,"2,773.9000",

julia> DataFrame(load("test3.csv"))
MethodError: Cannot `convert` an object of type Float64 to an object of type TextParse.StrRange
This may have arisen from a call to the constructor TextParse.StrRange(...),
since type constructors fall back to convert methods.ERROR: CSV parsing error in test3.csv at line 24 char 21:
1863209,137,0,1.0000,"2,773.9000",
____________________^
column 5 is expected to be: TextParse.Field{Float64,TextParse.Numeric{Float64}}(<Float64>, true, true, false)
Stacktrace:
 [1] copy!(::Array{TextParse.StrRange,1}, ::Int64, ::Array{Float64,1}, ::Int64, ::Int64) at ./abstractarray.jl:691
 [2] promote_column(::Array{Float64,1}, ::Int64, ::Type{T} where T, ::Bool) at /opt/julia/share/julia/site/v0.6/TextParse/src/csv.jl:460
 [3] promote_field(::String, ::TextParse.Field{Float64,TextParse.Numeric{Float64}}, ::Array{Float64,1}, ::TextParse.CSVParseError, ::Array{String,1}) at /opt/julia/share/julia/site/v0.6/TextParse/src/csv.jl:425
 [4] (::TextParse.##39#43{DataStructures.OrderedDict{Union{Int64, String},AbstractArray{T,1} where T}})(::String, ::Int64) at /opt/julia/share/julia/site/v0.6/TextParse/src/csv.jl:337
 [5] collect(::Base.Generator{Base.Iterators.Zip2{Array{String,1},UnitRange{Int64}},Base.##3#4{TextParse.##39#43{DataStructures.OrderedDict{Union{Int64, String},AbstractArray{T,1} where T}}}}) at ./array.jl:475
 [6] #_csvread_internal#35(::Bool, ::Char, ::Char, ::Bool, ::Bool, ::Int64, ::Void, ::Int64, ::Void, ::Bool, ::Array{String,1}, ::Array{String,1}, ::DataStructures.OrderedDict{Union{Int64, String},AbstractArray{T,1} where T}, ::Int64, ::Void, ::Array{Any,1}, ::String, ::Int64, ::TextParse.#_csvread_internal, ::String, ::Char) at /opt/julia/share/julia/site/v0.6/TextParse/src/csv.jl:333
 [7] (::TextParse.#kw##_csvread_internal)(::Array{Any,1}, ::TextParse.#_csvread_internal, ::String, ::Char) at ./<missing>:0
 [8] (::TextParse.##31#33{Array{Any,1},String,Char})(::IOStream) at /opt/julia/share/julia/site/v0.6/TextParse/src/csv.jl:97
 [9] open(::TextParse.##31#33{Array{Any,1},String,Char}, ::String, ::String) at ./iostream.jl:152
 [10] #_csvread_f#29(::Array{Any,1}, ::Function, ::String, ::Char) at /opt/julia/share/julia/site/v0.6/TextParse/src/csv.jl:95
 [11] #csvread#25(::Array{Any,1}, ::Function, ::String, ::Char) at /opt/julia/share/julia/site/v0.6/TextParse/src/csv.jl:69
 [12] getiterator(::CSVFiles.CSVFile) at /opt/julia/share/julia/site/v0.6/CSVFiles/src/CSVFiles.jl:49
 [13] _DataFrame(::CSVFiles.CSVFile) at /opt/julia/share/julia/site/v0.6/IterableTables/src/integrations/dataframes-missing.jl:100
 [14] DataFrames.DataFrame(::CSVFiles.CSVFile) at /opt/julia/share/julia/site/v0.6/IterableTables/src/integrations/dataframes-missing.jl:129


#5

good idea :slight_smile:


#6

You can specify that it should use more than the default 20 rows for figuring out the types of the columns with the type_detect_rows=24 option. Not ideal, it would be nicer if it actually just promoted the columns, I opened an issue for that.


#7

Thanks David. Do you think it would cause any slowdown? The file above is a dummy file for testing and the real one comes about at around 150,000th line.

I don’t understand the promotion part. The column should have been parsed as Float64. While the subsequent lines are quoted & localized, they are still Float64. Either way, it should be just parsing a string into a Float64…


#8

Ah, you want something different than what the row detect gives you! You want the column to be parsed as a Float64, whereas adding more lines to the type detection will just parse the whole column as a String.

So I guess the real question here is whether a float that is surrounded by quotes should be read as a Float64 or as a String… Or rather, whether the parser that handles Float64 columns could be made tolerant enough to also accept quoted floats. I’m not sure it should, actually… Does anyone know how CSV parsers on other platforms handle that?


#9

Here’s Pandas in Python3:

>>> pandas.read_csv("foo.csv")
         a    b      c       d           e         f
0  1863001  134  10000  1.0009      1.0000 -0.002034
1  1863209  137      0  1.0000  2,773.9000       NaN

#10

Thanks! It is not entirely clear to me whether column e is float or string, though?


#11

It’s parsed as a string:

>>> t["e"][1]
'2,773.9000'

#12

Actually, that whole column has dtype=object and the first row has also become a string:

>>> t["e"][0]
'1.0000'

#13

Ok, thanks.

@tk3369 you can get the same behavior as pandas by either specifying that more rows should be used for type detection (that will occur overhead), or you can specify the specific type to be used for the e column up front (I forgot the syntax for that right now, but the TextParse.jl doc should describe that). I actually think TextParse.jl should just widen the column type to String automatically in this situation, but that is not how TextParse.jl works right now (I did open an issue there on that).

I do think that the column should be String whenever there is at least one row that has stuff quoted. But one could probably think about the following: if someone explicitly specifies the column as `Float64, the parser could become a bit more forgiving and also handle floats that are in quotes…


#14

Thanks, all. I have since worked around the problem (not having to parse these malformed csv files and get the data in a different form.)