Unicode related error when reading a .csv

I also got error at 104000-th line.

If I simplify it then problem is next:

#this is good!
julia> CSV.readsplitline(IOBuffer("104652,\"Thanks  \\ \",a"))
3-element Array{CSV.RawField,1}:
 CSV.RawField("104652", false)    
 CSV.RawField("Thanks  \\ ", true)
 CSV.RawField("a", false) 

#this is suspicios (I think that it is wrong if we like to interpret python's output)
julia> CSV.readsplitline(IOBuffer("104652,Thanks  \\,a"))
2-element Array{CSV.RawField,1}:
 CSV.RawField("104652", false)      
 CSV.RawField("Thanks  \\,a", false)

# this one end with error
julia> CSV.readsplitline(IOBuffer("104652,\"Thanks  \\\",a"))
ERROR: CSV.CSVError("EOF while trying to read the closing quote")
Stacktrace:
 [1] readsplitline!(::Array{CSV.RawField,1}, ::Base.AbstractIOBuffer{Array{UInt8,1}}, ::UInt8, ::UInt8, ::UInt8, ::Base.AbstractIOBuffer{Array{UInt8,1}}) at /home/palo/.julia/v0.6/CSV/src/io.jl:114
 [2] readsplitline(::Base.AbstractIOBuffer{Array{UInt8,1}}) at /home/palo/.julia/v0.6/CSV/src/io.jl:124

So it seems that escaping hack escape_double_quote is not enough. We have to escape backspace before quote as well.

Next worked without error (I read and split all rows):

julia> escape_double_quote(s::String) = replace(s, "\"\"", "\\\"");

julia> escape_back_quote(s::String) = replace(s, "\\\"", "\\\\\"");

julia> esca(s::String) = escape_double_quote(escape_back_quote(s));

julia> f = open("output_bitcointalk_unicode.csv?dl=0");

julia> i=0;it="";spl=[];for i in 1:2_000_000 it=readline(f); spl=CSV.readsplitline(IOBuffer(esca(it))); eof(f) && break; end

julia> i
1063934

Warning! I am not sure that you will get good data using this escaping hack!

1 Like