CSV misread

tk3369 · May 17, 2018, 8:32pm

How do I make it recognize the quotes?

shell> cat > test.csv
a,b,c,d,e,f
1863001,134,10000,1.0009,1.0000,-0.002033899
1863209,137,0,1.0000,"2,773.9000",

julia> CSV.read("test.csv")
2×6 DataFrames.DataFrame
│ Row │ a       │ b   │ c     │ d      │ e   │ f          │
├─────┼─────────┼─────┼───────┼────────┼─────┼────────────┤
│ 1   │ 1863001 │ 134 │ 10000 │ 1.0009 │ 1.0 │ -0.0020339 │
│ 2   │ 1863209 │ 137 │ 0     │ 1.0    │ 2.0 │ 773.9      │

davidanthoff · May 17, 2018, 9:03pm

CSVFiles.jl gets it right:

julia> using FileIO, CSVFiles, DataFrames

julia> load("test.csv") |> DataFrame
2×6 DataFrames.DataFrame
│ Row │ a       │ b   │ c     │ d      │ e          │ f          │
├─────┼─────────┼─────┼───────┼────────┼────────────┼────────────┤
│ 1   │ 1863001 │ 134 │ 10000 │ 1.0009 │ 1.0000     │ -0.0020339 │
│ 2   │ 1863209 │ 137 │ 0     │ 1.0    │ 2,773.9000 │ missing    │

ScottPJones · May 17, 2018, 9:25pm

File an issue on CSV.jl?

tk3369 · May 18, 2018, 5:17am

Actually, CSVFiles also breaks when I have more rows… Try this:

shell> cat > test3.csv
a,b,c,d,e,f
1863001,134,9981,1.0019,1.0000,0.0246182823
1863001,134,9982,1.0046,1.0000,0.0057100506
1863001,134,9983,1.0186,1.0000,-0.007331108
1863001,134,9984,1.0069,1.0000,0.0052499037
1863001,134,9985,0.9994,1.0000,0.0017077061
1863001,134,9986,1.0009,1.0000,0.0147836641
1863001,134,9987,0.9968,1.0000,-0.004429422
1863001,134,9988,1.0006,1.0000,0.0047920628
1863001,134,9989,1.0143,1.0000,-0.009376122
1863001,134,9990,1.0268,1.0000,-0.002704543
1863001,134,9991,1.0124,1.0000,-0.005628563
1863001,134,9992,1.0053,1.0000,0.0098183902
1863001,134,9993,1.0026,1.0000,0.0098087401
1863001,134,9994,0.9935,1.0000,0.0029873303
1863001,134,9995,1.0112,1.0000,0.0021465219
1863001,134,9996,0.9926,1.0000,-0.006265291
1863001,134,9997,0.9968,1.0000,0.0061212349
1863001,134,9998,0.9999,1.0000,-0.00676656
1863001,134,9999,1.0012,1.0000,0.0010823933
1863001,134,10000,1.0009,1.0000,-0.002033899
1863209,137,0,1.0000,"2,773.9000",

julia> DataFrame(load("test3.csv"))
MethodError: Cannot `convert` an object of type Float64 to an object of type TextParse.StrRange
This may have arisen from a call to the constructor TextParse.StrRange(...),
since type constructors fall back to convert methods.ERROR: CSV parsing error in test3.csv at line 24 char 21:
1863209,137,0,1.0000,"2,773.9000",
____________________^
column 5 is expected to be: TextParse.Field{Float64,TextParse.Numeric{Float64}}(<Float64>, true, true, false)
Stacktrace:
 [1] copy!(::Array{TextParse.StrRange,1}, ::Int64, ::Array{Float64,1}, ::Int64, ::Int64) at ./abstractarray.jl:691
 [2] promote_column(::Array{Float64,1}, ::Int64, ::Type{T} where T, ::Bool) at /opt/julia/share/julia/site/v0.6/TextParse/src/csv.jl:460
 [3] promote_field(::String, ::TextParse.Field{Float64,TextParse.Numeric{Float64}}, ::Array{Float64,1}, ::TextParse.CSVParseError, ::Array{String,1}) at /opt/julia/share/julia/site/v0.6/TextParse/src/csv.jl:425
 [4] (::TextParse.##39#43{DataStructures.OrderedDict{Union{Int64, String},AbstractArray{T,1} where T}})(::String, ::Int64) at /opt/julia/share/julia/site/v0.6/TextParse/src/csv.jl:337
 [5] collect(::Base.Generator{Base.Iterators.Zip2{Array{String,1},UnitRange{Int64}},Base.##3#4{TextParse.##39#43{DataStructures.OrderedDict{Union{Int64, String},AbstractArray{T,1} where T}}}}) at ./array.jl:475
 [6] #_csvread_internal#35(::Bool, ::Char, ::Char, ::Bool, ::Bool, ::Int64, ::Void, ::Int64, ::Void, ::Bool, ::Array{String,1}, ::Array{String,1}, ::DataStructures.OrderedDict{Union{Int64, String},AbstractArray{T,1} where T}, ::Int64, ::Void, ::Array{Any,1}, ::String, ::Int64, ::TextParse.#_csvread_internal, ::String, ::Char) at /opt/julia/share/julia/site/v0.6/TextParse/src/csv.jl:333
 [7] (::TextParse.#kw##_csvread_internal)(::Array{Any,1}, ::TextParse.#_csvread_internal, ::String, ::Char) at ./<missing>:0
 [8] (::TextParse.##31#33{Array{Any,1},String,Char})(::IOStream) at /opt/julia/share/julia/site/v0.6/TextParse/src/csv.jl:97
 [9] open(::TextParse.##31#33{Array{Any,1},String,Char}, ::String, ::String) at ./iostream.jl:152
 [10] #_csvread_f#29(::Array{Any,1}, ::Function, ::String, ::Char) at /opt/julia/share/julia/site/v0.6/TextParse/src/csv.jl:95
 [11] #csvread#25(::Array{Any,1}, ::Function, ::String, ::Char) at /opt/julia/share/julia/site/v0.6/TextParse/src/csv.jl:69
 [12] getiterator(::CSVFiles.CSVFile) at /opt/julia/share/julia/site/v0.6/CSVFiles/src/CSVFiles.jl:49
 [13] _DataFrame(::CSVFiles.CSVFile) at /opt/julia/share/julia/site/v0.6/IterableTables/src/integrations/dataframes-missing.jl:100
 [14] DataFrames.DataFrame(::CSVFiles.CSVFile) at /opt/julia/share/julia/site/v0.6/IterableTables/src/integrations/dataframes-missing.jl:129

tk3369 · May 18, 2018, 5:20am

good idea

https://github.com/JuliaData/CSV.jl/issues/207

davidanthoff · May 21, 2018, 9:40pm

You can specify that it should use more than the default 20 rows for figuring out the types of the columns with the type_detect_rows=24 option. Not ideal, it would be nicer if it actually just promoted the columns, I opened an issue for that.

tk3369 · May 22, 2018, 3:55am

Thanks David. Do you think it would cause any slowdown? The file above is a dummy file for testing and the real one comes about at around 150,000th line.

I don’t understand the promotion part. The column should have been parsed as Float64. While the subsequent lines are quoted & localized, they are still Float64. Either way, it should be just parsing a string into a Float64…

davidanthoff · May 22, 2018, 4:16am

Ah, you want something different than what the row detect gives you! You want the column to be parsed as a Float64, whereas adding more lines to the type detection will just parse the whole column as a String.

So I guess the real question here is whether a float that is surrounded by quotes should be read as a Float64 or as a String… Or rather, whether the parser that handles Float64 columns could be made tolerant enough to also accept quoted floats. I’m not sure it should, actually… Does anyone know how CSV parsers on other platforms handle that?

rdeits · May 22, 2018, 4:26am

Here’s Pandas in Python3:

>>> pandas.read_csv("foo.csv")
         a    b      c       d           e         f
0  1863001  134  10000  1.0009      1.0000 -0.002034
1  1863209  137      0  1.0000  2,773.9000       NaN

davidanthoff · May 22, 2018, 4:28am

Thanks! It is not entirely clear to me whether column e is float or string, though?

rdeits · May 22, 2018, 4:37am

It’s parsed as a string:

>>> t["e"][1]
'2,773.9000'

rdeits · May 22, 2018, 4:38am

Actually, that whole column has dtype=object and the first row has also become a string:

>>> t["e"][0]
'1.0000'

davidanthoff · May 22, 2018, 4:43am

Ok, thanks.

@tk3369 you can get the same behavior as pandas by either specifying that more rows should be used for type detection (that will occur overhead), or you can specify the specific type to be used for the e column up front (I forgot the syntax for that right now, but the TextParse.jl doc should describe that). I actually think TextParse.jl should just widen the column type to String automatically in this situation, but that is not how TextParse.jl works right now (I did open an issue there on that).

I do think that the column should be String whenever there is at least one row that has stuff quoted. But one could probably think about the following: if someone explicitly specifies the column as `Float64, the parser could become a bit more forgiving and also handle floats that are in quotes…

tk3369 · May 28, 2018, 7:58am

Thanks, all. I have since worked around the problem (not having to parse these malformed csv files and get the data in a different form.)

Topic		Replies	Views
String7 type with read CSV? New to Julia	8	361	June 23, 2023
Fatal error while reading in messy data using DataFrames, CSV Data dataframes , csv	6	629	May 25, 2021
CSV won't read tab separated file General Usage csv	23	651	March 4, 2024
CSV.jl writing quoted strings General Usage question , csv	14	96	December 19, 2024
CSV, white spaces and data type. Force Float on empty fields General Usage csv	23	1293	January 21, 2021

CSV misread

Related topics