Unicode related error when reading a .csv


#1

Hi everyone,

I get the following error message when trying to run readcsv():

at row 2, column 5 : UnicodeError: invalid character index)
in readcsv at base\datafmt.jl:619
in #readcsv#15 at base\datafmt.jl:619 <inlined>
in readdlm at base\datafmt.jl:54
in #readdlm#4 at base\datafmt.jl:54 <inlined>
in readdlm at base\datafmt.jl:114
in #readdlm#8 at base\datafmt.jl:114 <inlined>
in #readdlm_auto#11 at base\datafmt.jl:134
in readdlm_string at base\datafmt.jl:343
in dlm_parse at base\datafmt.jl:610

Does anybody know what this is about and what I can do about it? I saved the file before as .xlsx and it was read fine without problems using the Package ExcelReaders. This solution does not work for me anymore, as I realized that I missed some rows as I am hitting the upper limit for xlsx files.


#2

Could you attach the first few rows of the file (since the error occurs in row 2), and also give the version information for Julia?
Have you tried the https://github.com/JuliaData/CSV.jl package? (@quinnj has done quite a bit of work on this, it should do a better job of handling Unicode issues - but you might need to do something to convert the data from whatever character set it is actually stored in to UTF-8)


#3

I tried also CSV.read(), but it gave me the following error message:


CSV.CSVError("Unexpected start of quote (34), use \"9234\" to type \"34\"")
in read at CSV\src\Source.jl:294
in #read#29 at CSV\src\Source.jl:294
in #Source#11 at CSV\src\Source.jl:25
in  at base\<missing>
in #Source#12 at CSV\src\Source.jl:137
in readsplitline! at CSV\src\io.jl:89

I am running Julia 0.6.1 and the first few lines of my file are:

,author_C,author_OP,date_C,datetime_OP,text_C,title_OP,views_OP,compound

0,hasmukh_rawal,hasmukh_rawal,β€œNovember 10, 2017, 01:15:58 PM”,β€œNovember 10, 2017, 01:15:58 PM”,My friend just sent me this and said that Sewit2x is continuing the fork. The website has an announcement on their page. Is this true or a scam ? https://bitcoin2x.org/,Segwit2x continuing the fork !,1327,0.4118

1,rohqit,hasmukh_rawal,β€œNovember 10, 2017, 01:26:53 PM”,β€œNovember 10, 2017, 01:15:58 PM”,β€œQuote from: hasmukh_rawal on November 10, 2017, 01:15:58 PM My friend just sent me this and said that Sewit2x is continuing the fork. The website has an announcement on their page. Is this true or a scam ? https://bitcoin2x.org/ yeah heard this too, at this point I don’t know who or what to believe. Group claims to have 30% of bitcoin hashrate Either it happens Nov 15 or some other fork will soon follow or maybe Bitcoin cash draws people away from bitcoin. This thing is a mess, personally staying away from bitcoin until the dust settles or a huge dip. Way too centralized.”,Segwit2x continuing the fork !,1327,0.5789

2,dissgo,hasmukh_rawal,β€œNovember 10, 2017, 01:46:08 PM”,β€œNovember 10, 2017, 01:15:58 PM”,β€œI still ain’t believe first. I’m still searching for the news about it. When it’s back or continue, it should be announced by the devs.”,Segwit2x continuing the fork !,1327,0.0

3,darkangel11,hasmukh_rawal,β€œNovember 10, 2017, 02:07:27 PM”,β€œNovember 10, 2017, 01:15:58 PM”,2 months ago: We have decided to fork! 1 day ago: we have decided not to fork! today: we have decided to fork! Make up your minds you assholes! I bet it’s a play to lower the price right now and it’s working. People hate uncertainty and the feeling that their coins are being played with by a bunch of nerds that can’t even agree whether they will or will not try to fuck it all up for everybody. ,Segwit2x continuing the fork !,1327,-0.8035


Problem related to CSV.read() and strings
[ANN]: WIP Strs.jl package ready for alpha review and testing
#4

You could also try https://github.com/davidanthoff/CSVFiles.jl. It uses https://github.com/JuliaComputing/TextParse.jl under the hood, and I have no idea whether it will be able to deal with that file, but probably worth a try.


#5

Sadly, something similar there:

UnicodeError: invalid character index
in DataFrames.DataFrame at IterableTables\src\integrations\dataframes-dataarray.jl:127
in _DataFrame at IterableTables\src\integrations\dataframes-dataarray.jl:98
in getiterator at CSVFiles\src\CSVFiles.jl:31
in open at base\iostream.jl:152
in  at TextParse\src\csv.jl:72
in  at base\<missing>
in #csvread#28 at TextParse\src\csv.jl:78
in  at base\<missing>
in #_csvread#30 at TextParse\src\csv.jl:87
in  at base\<missing>
in #_csvread_internal#35 at TextParse\src\csv.jl:194
in guesscolparsers at TextParse\src\csv.jl:484
in getlineend at TextParse\src\util.jl:130
in next at base\strings\string.jl:204 <inlined>
in slow_utf8_next at base\strings\string.jl:172

#6

Hm, maybe this just isn’t encoded as UTF-8?


#7

Is there a way I can check that or change how the file is encoded? I saved the file on my own computer using Python and want to do some more manipulations with Julia. I would be thankful for any information!

Edit: I found a way to encode my CSV in unicode in Python. I will see if Julia can work with that.


#8

Instead of cutting and pasting (which might do some transformations), could you upload the first few lines of the file, maybe as a gist? That way it would be possible to see the raw data


#9

It is likely it’s the fancy quotes that are being used, that are causing problems, i.e.:

β€˜β€œβ€™: Unicode U+201c (category Pi: Punctuation, initial quote)
’”’: Unicode U+201d (category Pf: Punctuation, final quote)

Those are not really normal to see in a plain CSV file (maybe something that Excel does?)


#10

I’ve run into this a few times: Readdlm: ignore_invalid_chars option does not exist anymore. There used to be the very helpful option ignore_invalid_chars in dlmread: https://docs.julialang.org/en/release-0.4/stdlib/io-network/ which magically handled my issues. I in my case CSV.jl did work though.

Also note that copy-paste into a gist likely removes the offending characters. I had to push to the gist linked in my post to get the offending character there.

Anyway, here for you to try at home:

/tmp >> git  clone git@gist.github.com:8bf1448cf05356458710c02572722c88.git
Cloning into '8bf1448cf05356458710c02572722c88'...
remote: Counting objects: 6, done.
remote: Total 6 (delta 0), reused 0 (delta 0), pack-reused 6
Receiving objects: 100% (6/6), done.
Resolving deltas: 100% (1/1), done.
/tmp >> cd 8bf1448cf05356458710c02572722c88 
/tmp/8bf1448cf05356458710c02572722c88(master)  >> julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.1 (2017-10-24 22:15 UTC)
 _/ |\__'_|_|_|\__'_|  |  
|__/                   |  x86_64-pc-linux-gnu

julia> readdlm("txt")
ERROR: at row 2, column 3 : UnicodeError: invalid character index)
Stacktrace:
 [1] dlm_parse(::String, ::Char, ::Char, ::Char, ::Char, ::Bool, ::Bool, ::Bool, ::Int64, ::Bool, ::Base.DataFmt.DLMOffsets) at ./datafmt.jl:610
 [2] readdlm_string(::String, ::Char, ::Type, ::Char, ::Bool, ::Dict{Symbol,Union{Char, Integer, Tuple{Integer,Integer}}}) at ./datafmt.jl:343
 [3] #readdlm_auto#11(::Array{Any,1}, ::Function, ::String, ::Char, ::Type{T} where T, ::Char, ::Bool) at ./datafmt.jl:132
 [4] #readdlm#7 at ./datafmt.jl:81 [inlined]
 [5] readdlm(::String, ::Char, ::Char) at ./datafmt.jl:81
 [6] #readdlm#5(::Array{Any,1}, ::Function, ::String) at ./datafmt.jl:64
 [7] readdlm(::String) at ./datafmt.jl:64

julia> using CSV
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:238 overwritten in module NullableArrays at /home/mauro/.julia/v0.6/NullableArrays/src/operators.jl:99.

julia> CSV.read("txt")
5Γ—1 DataFrames.DataFrame
β”‚ Row β”‚ Device;Device serial;ID;Date/Time;Value;Unit;Mode;Value2;Unit2;Mode2;Measurement;Calibration;Additional;Sensor;Sensor serial;User    β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ "Multi 3630; 16231200;2;30.08.2016 15:57:11;0.1;Β΅S/cm;Cond;22.6;Β°C;Temp;;;C = 0.475 1/cm   Tref25   nLF;TetraCon 925-P; 16201701;" β”‚
β”‚ 2   β”‚ "Multi 3630; 16231200;2;30.08.2016 15:57:11;0.1;Β΅S/cm;Cond;22.6;Β°C;Temp;;;C = 0.475 1/cm   Tref25   nLF;TetraCon 925-P; 16201701;" β”‚
β”‚ 3   β”‚ "Multi 3630; 16231200;2;30.08.2016 15:57:12;0.1;Β΅S/cm;Cond;22.6;Β°C;Temp;;;C = 0.475 1/cm   Tref25   nLF;TetraCon 925-P; 16201701;" β”‚
β”‚ 4   β”‚ "Multi 3630; 16231200;2;30.08.2016 15:57:13;0.1;Β΅S/cm;Cond;22.6;Β°C;Temp;;;C = 0.475 1/cm   Tref25   nLF;TetraCon 925-P; 16201701;" β”‚
β”‚ 5   β”‚ "Multi 3630; 16231200;2;30.08.2016 15:57:14;0.1;Β΅S/cm;Cond;22.6;Β°C;Temp;;;C = 0.475 1/cm   Tref25   nLF;TetraCon 925-P; 16201701;" β”‚

julia> 

Edit: there is also uCSV.jl to try.


#11

uCSV.jl seems to work well when I get rid of the text and only use the entries with the date and the numbers, but it seems to do nothing, when I try to load the big dataset with all variables. I’ll let it run for some time now and look if something happens.


#12

Can’t you use some other format to transfer the data? HDF5 maybe?


#13

I’ll give it a try, thanks!


#14

It would be good to get to the bottom of this, i.e. to figure out whether the various CSV readers should actually read that file or whether there is actually a problem with the file itself.


#15

It would also great to have an escape hatch to still read the file and just drop the offending field.


#16

I finally had this error: https://github.com/JuliaData/CSV.jl/issues/86 with CSV.read. I tried to checkout the master and see if that works, but then it throws a different error related to DataStreams. Now I went back to the current branch and employed the workaround described in the csv-issue.

Now I get the following error:

CSV.CSVError("error parsing a `Int64` value on column 8, row 13897; encountered 'B'")
in read at CSV\src\Source.jl:294
in #read#29 at CSV\src\Source.jl:299
in stream! at DataStreams\src\DataStreams.jl:145
in #stream!#5 at DataStreams\src\DataStreams.jl:151
in stream! at DataStreams\src\DataStreams.jl:187
in streamto! at DataStreams\src\DataStreams.jl:173
in streamfrom at CSV\src\Source.jl:195
in parsefield at CSV\src\parsefields.jl:107 <inlined>
in parsefield at CSV\src\parsefields.jl:127 <inlined>
in checknullend at CSV\src\parsefields.jl:56 <inlined>

There seems to be something happening with the escapechar being " in front of other " and then something happening with the delimiters in this row.

EDIT: After getting rid of double-quotation marks manually in the csv file, I am able to read the file using CSV.read(), but now it reads not all rows, but stops after 104000 of over a million rows.

I am done for today. Maybe I’ll look into it some more tomorrow.


#17

Is there any chance you could just put the file somewhere for download so that we could try to see what is going on?

I’m not entirely following what you are doing, i.e. did you change the file now and are now hitting a new error? Or something else?


#18

I did the following:

  1. Save csv in Python df.to_csv('output_bitcointalk.csv'). This brought up all kinds of issues above. I tried different packages, for example CSV, uCSV, readtables() through DataFrames etc., but it didn’t work. You brought up that the csv might not be encoded in unicode afterall.
  2. I went back to Python and saved it through df.to_csv('output_bitcointalk_unicode.csv',encoding='utf-8'). For test purposes, I eliminated all columns except for Date in a string and a Float-column that contains some measure of sentiment. I could read that .csv using uCSV. It seems to work also on the bigger file, but Julia just keeps consuming CPU power without actually loading anything into its cache. I don’t know if something would come of it, if I ran it any longer. CSV.read() then threw the error as in https://github.com/JuliaData/CSV.jl/issues/86, i.e. this one: ERROR: CSV.CSVError("Unexpected start of quote (34), use \"9234\" to type \"34\""). I tried the workaround in the thread, which is:

escape_double_quote(s::String) = replace(s, "\"\"", "\\\"")
cleaned_file = IOBuffer(escape_double_quote(readstring(open("Downloads/WellIndex_20160811_c.csv"))))
CSV.read(cleaned_file, DataTable)

This did not work as well. There was still an issue with escapechars and delimiters, i.e. it somehow counted an extra delimiter in one of the rows. Maybe the rule above can be adjusted to fix the problem.

  1. After doing replace(s, "\"\"", "\\\"") manually in Notepad, I was able to read the file using uCSV.read("output_bitcointalk_unicode.csv", quotes='"', header=1), but not all of it. This is where I stopped for today. I saved the file as hdf to see if I can read it in this format.

Here is a Dropbox-Link to the file. It has about 800 MB. If you find a way to make it work, please let me know. https://www.dropbox.com/s/5r81mnoki8mffg7/output_bitcointalk_unicode.csv?dl=0


#19

I also got error at 104000-th line.

If I simplify it then problem is next:

#this is good!
julia> CSV.readsplitline(IOBuffer("104652,\"Thanks  \\ \",a"))
3-element Array{CSV.RawField,1}:
 CSV.RawField("104652", false)    
 CSV.RawField("Thanks  \\ ", true)
 CSV.RawField("a", false) 

#this is suspicios (I think that it is wrong if we like to interpret python's output)
julia> CSV.readsplitline(IOBuffer("104652,Thanks  \\,a"))
2-element Array{CSV.RawField,1}:
 CSV.RawField("104652", false)      
 CSV.RawField("Thanks  \\,a", false)

# this one end with error
julia> CSV.readsplitline(IOBuffer("104652,\"Thanks  \\\",a"))
ERROR: CSV.CSVError("EOF while trying to read the closing quote")
Stacktrace:
 [1] readsplitline!(::Array{CSV.RawField,1}, ::Base.AbstractIOBuffer{Array{UInt8,1}}, ::UInt8, ::UInt8, ::UInt8, ::Base.AbstractIOBuffer{Array{UInt8,1}}) at /home/palo/.julia/v0.6/CSV/src/io.jl:114
 [2] readsplitline(::Base.AbstractIOBuffer{Array{UInt8,1}}) at /home/palo/.julia/v0.6/CSV/src/io.jl:124

So it seems that escaping hack escape_double_quote is not enough. We have to escape backspace before quote as well.

Next worked without error (I read and split all rows):

julia> escape_double_quote(s::String) = replace(s, "\"\"", "\\\"");

julia> escape_back_quote(s::String) = replace(s, "\\\"", "\\\\\"");

julia> esca(s::String) = escape_double_quote(escape_back_quote(s));

julia> f = open("output_bitcointalk_unicode.csv?dl=0");

julia> i=0;it="";spl=[];for i in 1:2_000_000 it=readline(f); spl=CSV.readsplitline(IOBuffer(esca(it))); eof(f) && break; end

julia> i
1063934

Warning! I am not sure that you will get good data using this escaping hack!


Problem related to CSV.read() and strings
#20

Thanks, the following code worked like a charm!

escape_double_quote(s::String) = replace(s, "\"\"", "\\\"")
escape_back_quote(s::String) = replace(s, "\\\"", "\\\\\"");
esca(s::String) = escape_double_quote(escape_back_quote(s));
f = open("output_bitcointalk_unicode.csv");
cleaned_file = IOBuffer(esca(readstring(f)))
test=CSV.read(cleaned_file, DataFrame)