Unicode related error when reading a .csv

IljaK91 · November 17, 2017, 4:04pm

Hi everyone,

I get the following error message when trying to run readcsv():

at row 2, column 5 : UnicodeError: invalid character index)
in readcsv at base\datafmt.jl:619
in #readcsv#15 at base\datafmt.jl:619 <inlined>
in readdlm at base\datafmt.jl:54
in #readdlm#4 at base\datafmt.jl:54 <inlined>
in readdlm at base\datafmt.jl:114
in #readdlm#8 at base\datafmt.jl:114 <inlined>
in #readdlm_auto#11 at base\datafmt.jl:134
in readdlm_string at base\datafmt.jl:343
in dlm_parse at base\datafmt.jl:610

Does anybody know what this is about and what I can do about it? I saved the file before as .xlsx and it was read fine without problems using the Package ExcelReaders. This solution does not work for me anymore, as I realized that I missed some rows as I am hitting the upper limit for xlsx files.

ScottPJones · November 17, 2017, 4:15pm

Could you attach the first few rows of the file (since the error occurs in row 2), and also give the version information for Julia?
Have you tried the GitHub - JuliaData/CSV.jl: Utility library for working with CSV and other delimited files in the Julia programming language package? (@quinnj has done quite a bit of work on this, it should do a better job of handling Unicode issues - but you might need to do something to convert the data from whatever character set it is actually stored in to UTF-8)

IljaK91 · November 17, 2017, 4:29pm

I tried also CSV.read(), but it gave me the following error message:


CSV.CSVError("Unexpected start of quote (34), use \"9234\" to type \"34\"")
in read at CSV\src\Source.jl:294
in #read#29 at CSV\src\Source.jl:294
in #Source#11 at CSV\src\Source.jl:25
in  at base\<missing>
in #Source#12 at CSV\src\Source.jl:137
in readsplitline! at CSV\src\io.jl:89

I am running Julia 0.6.1 and the first few lines of my file are:

,author_C,author_OP,date_C,datetime_OP,text_C,title_OP,views_OP,compound

0,hasmukh_rawal,hasmukh_rawal,“November 10, 2017, 01:15:58 PM”,“November 10, 2017, 01:15:58 PM”,My friend just sent me this and said that Sewit2x is continuing the fork. The website has an announcement on their page. Is this true or a scam ? https://bitcoin2x.org/,Segwit2x continuing the fork !,1327,0.4118

1,rohqit,hasmukh_rawal,“November 10, 2017, 01:26:53 PM”,“November 10, 2017, 01:15:58 PM”,“Quote from: hasmukh_rawal on November 10, 2017, 01:15:58 PM My friend just sent me this and said that Sewit2x is continuing the fork. The website has an announcement on their page. Is this true or a scam ? https://bitcoin2x.org/ yeah heard this too, at this point I don’t know who or what to believe. Group claims to have 30% of bitcoin hashrate Either it happens Nov 15 or some other fork will soon follow or maybe Bitcoin cash draws people away from bitcoin. This thing is a mess, personally staying away from bitcoin until the dust settles or a huge dip. Way too centralized.”,Segwit2x continuing the fork !,1327,0.5789

2,dissgo,hasmukh_rawal,“November 10, 2017, 01:46:08 PM”,“November 10, 2017, 01:15:58 PM”,“I still ain’t believe first. I’m still searching for the news about it. When it’s back or continue, it should be announced by the devs.”,Segwit2x continuing the fork !,1327,0.0

3,darkangel11,hasmukh_rawal,“November 10, 2017, 02:07:27 PM”,“November 10, 2017, 01:15:58 PM”,2 months ago: We have decided to fork! 1 day ago: we have decided not to fork! today: we have decided to fork! Make up your minds you assholes! I bet it’s a play to lower the price right now and it’s working. People hate uncertainty and the feeling that their coins are being played with by a bunch of nerds that can’t even agree whether they will or will not try to fuck it all up for everybody. ,Segwit2x continuing the fork !,1327,-0.8035

davidanthoff · November 17, 2017, 4:30pm

You could also try https://github.com/davidanthoff/CSVFiles.jl. It uses GitHub - queryverse/TextParse.jl: A bunch of fast text parsing tools under the hood, and I have no idea whether it will be able to deal with that file, but probably worth a try.

IljaK91 · November 17, 2017, 4:40pm

Sadly, something similar there:

UnicodeError: invalid character index
in DataFrames.DataFrame at IterableTables\src\integrations\dataframes-dataarray.jl:127
in _DataFrame at IterableTables\src\integrations\dataframes-dataarray.jl:98
in getiterator at CSVFiles\src\CSVFiles.jl:31
in open at base\iostream.jl:152
in  at TextParse\src\csv.jl:72
in  at base\<missing>
in #csvread#28 at TextParse\src\csv.jl:78
in  at base\<missing>
in #_csvread#30 at TextParse\src\csv.jl:87
in  at base\<missing>
in #_csvread_internal#35 at TextParse\src\csv.jl:194
in guesscolparsers at TextParse\src\csv.jl:484
in getlineend at TextParse\src\util.jl:130
in next at base\strings\string.jl:204 <inlined>
in slow_utf8_next at base\strings\string.jl:172

davidanthoff · November 17, 2017, 4:51pm

Hm, maybe this just isn’t encoded as UTF-8?

IljaK91 · November 17, 2017, 5:07pm

Is there a way I can check that or change how the file is encoded? I saved the file on my own computer using Python and want to do some more manipulations with Julia. I would be thankful for any information!

Edit: I found a way to encode my CSV in unicode in Python. I will see if Julia can work with that.

ScottPJones · November 17, 2017, 5:47pm

Instead of cutting and pasting (which might do some transformations), could you upload the first few lines of the file, maybe as a gist? That way it would be possible to see the raw data

ScottPJones · November 17, 2017, 5:51pm

It is likely it’s the fancy quotes that are being used, that are causing problems, i.e.:

‘“’: Unicode U+201c (category Pi: Punctuation, initial quote)
‘”’: Unicode U+201d (category Pf: Punctuation, final quote)

Those are not really normal to see in a plain CSV file (maybe something that Excel does?)

mauro3 · November 17, 2017, 5:57pm

I’ve run into this a few times: Readdlm: ignore_invalid_chars option does not exist anymore. There used to be the very helpful option ignore_invalid_chars in dlmread: https://docs.julialang.org/en/release-0.4/stdlib/io-network/ which magically handled my issues. I in my case CSV.jl did work though.

Also note that copy-paste into a gist likely removes the offending characters. I had to push to the gist linked in my post to get the offending character there.

Anyway, here for you to try at home:

/tmp >> git  clone git@gist.github.com:8bf1448cf05356458710c02572722c88.git
Cloning into '8bf1448cf05356458710c02572722c88'...
remote: Counting objects: 6, done.
remote: Total 6 (delta 0), reused 0 (delta 0), pack-reused 6
Receiving objects: 100% (6/6), done.
Resolving deltas: 100% (1/1), done.
/tmp >> cd 8bf1448cf05356458710c02572722c88 
/tmp/8bf1448cf05356458710c02572722c88(master)  >> julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.1 (2017-10-24 22:15 UTC)
 _/ |\__'_|_|_|\__'_|  |  
|__/                   |  x86_64-pc-linux-gnu

julia> readdlm("txt")
ERROR: at row 2, column 3 : UnicodeError: invalid character index)
Stacktrace:
 [1] dlm_parse(::String, ::Char, ::Char, ::Char, ::Char, ::Bool, ::Bool, ::Bool, ::Int64, ::Bool, ::Base.DataFmt.DLMOffsets) at ./datafmt.jl:610
 [2] readdlm_string(::String, ::Char, ::Type, ::Char, ::Bool, ::Dict{Symbol,Union{Char, Integer, Tuple{Integer,Integer}}}) at ./datafmt.jl:343
 [3] #readdlm_auto#11(::Array{Any,1}, ::Function, ::String, ::Char, ::Type{T} where T, ::Char, ::Bool) at ./datafmt.jl:132
 [4] #readdlm#7 at ./datafmt.jl:81 [inlined]
 [5] readdlm(::String, ::Char, ::Char) at ./datafmt.jl:81
 [6] #readdlm#5(::Array{Any,1}, ::Function, ::String) at ./datafmt.jl:64
 [7] readdlm(::String) at ./datafmt.jl:64

julia> using CSV
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:238 overwritten in module NullableArrays at /home/mauro/.julia/v0.6/NullableArrays/src/operators.jl:99.

julia> CSV.read("txt")
5×1 DataFrames.DataFrame
│ Row │ Device;Device serial;ID;Date/Time;Value;Unit;Mode;Value2;Unit2;Mode2;Measurement;Calibration;Additional;Sensor;Sensor serial;User    │
├─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 1   │ "Multi 3630; 16231200;2;30.08.2016 15:57:11;0.1;µS/cm;Cond;22.6;°C;Temp;;;C = 0.475 1/cm   Tref25   nLF;TetraCon 925-P; 16201701;" │
│ 2   │ "Multi 3630; 16231200;2;30.08.2016 15:57:11;0.1;µS/cm;Cond;22.6;°C;Temp;;;C = 0.475 1/cm   Tref25   nLF;TetraCon 925-P; 16201701;" │
│ 3   │ "Multi 3630; 16231200;2;30.08.2016 15:57:12;0.1;µS/cm;Cond;22.6;°C;Temp;;;C = 0.475 1/cm   Tref25   nLF;TetraCon 925-P; 16201701;" │
│ 4   │ "Multi 3630; 16231200;2;30.08.2016 15:57:13;0.1;µS/cm;Cond;22.6;°C;Temp;;;C = 0.475 1/cm   Tref25   nLF;TetraCon 925-P; 16201701;" │
│ 5   │ "Multi 3630; 16231200;2;30.08.2016 15:57:14;0.1;µS/cm;Cond;22.6;°C;Temp;;;C = 0.475 1/cm   Tref25   nLF;TetraCon 925-P; 16201701;" │

julia>

Edit: there is also uCSV.jl to try.

IljaK91 · November 17, 2017, 8:48pm

uCSV.jl seems to work well when I get rid of the text and only use the entries with the date and the numbers, but it seems to do nothing, when I try to load the big dataset with all variables. I’ll let it run for some time now and look if something happens.

Tero_Frondelius · November 17, 2017, 8:55pm

Can’t you use some other format to transfer the data? HDF5 maybe?

IljaK91 · November 17, 2017, 9:06pm

I’ll give it a try, thanks!

davidanthoff · November 17, 2017, 9:13pm

It would be good to get to the bottom of this, i.e. to figure out whether the various CSV readers should actually read that file or whether there is actually a problem with the file itself.

mauro3 · November 17, 2017, 9:17pm

It would also great to have an escape hatch to still read the file and just drop the offending field.

IljaK91 · November 17, 2017, 10:02pm

I finally had this error: https://github.com/JuliaData/CSV.jl/issues/86 with CSV.read. I tried to checkout the master and see if that works, but then it throws a different error related to DataStreams. Now I went back to the current branch and employed the workaround described in the csv-issue.

Now I get the following error:

CSV.CSVError("error parsing a `Int64` value on column 8, row 13897; encountered 'B'")
in read at CSV\src\Source.jl:294
in #read#29 at CSV\src\Source.jl:299
in stream! at DataStreams\src\DataStreams.jl:145
in #stream!#5 at DataStreams\src\DataStreams.jl:151
in stream! at DataStreams\src\DataStreams.jl:187
in streamto! at DataStreams\src\DataStreams.jl:173
in streamfrom at CSV\src\Source.jl:195
in parsefield at CSV\src\parsefields.jl:107 <inlined>
in parsefield at CSV\src\parsefields.jl:127 <inlined>
in checknullend at CSV\src\parsefields.jl:56 <inlined>

There seems to be something happening with the escapechar being " in front of other " and then something happening with the delimiters in this row.

EDIT: After getting rid of double-quotation marks manually in the csv file, I am able to read the file using CSV.read(), but now it reads not all rows, but stops after 104000 of over a million rows.

I am done for today. Maybe I’ll look into it some more tomorrow.

davidanthoff · November 17, 2017, 10:27pm

Is there any chance you could just put the file somewhere for download so that we could try to see what is going on?

I’m not entirely following what you are doing, i.e. did you change the file now and are now hitting a new error? Or something else?

IljaK91 · November 17, 2017, 11:04pm

I did the following:

Save csv in Python df.to_csv('output_bitcointalk.csv'). This brought up all kinds of issues above. I tried different packages, for example CSV, uCSV, readtables() through DataFrames etc., but it didn’t work. You brought up that the csv might not be encoded in unicode afterall.
I went back to Python and saved it through df.to_csv('output_bitcointalk_unicode.csv',encoding='utf-8'). For test purposes, I eliminated all columns except for Date in a string and a Float-column that contains some measure of sentiment. I could read that .csv using uCSV. It seems to work also on the bigger file, but Julia just keeps consuming CPU power without actually loading anything into its cache. I don’t know if something would come of it, if I ran it any longer. CSV.read() then threw the error as in https://github.com/JuliaData/CSV.jl/issues/86, i.e. this one: ERROR: CSV.CSVError("Unexpected start of quote (34), use \"9234\" to type \"34\""). I tried the workaround in the thread, which is:


escape_double_quote(s::String) = replace(s, "\"\"", "\\\"")
cleaned_file = IOBuffer(escape_double_quote(readstring(open("Downloads/WellIndex_20160811_c.csv"))))
CSV.read(cleaned_file, DataTable)

This did not work as well. There was still an issue with escapechars and delimiters, i.e. it somehow counted an extra delimiter in one of the rows. Maybe the rule above can be adjusted to fix the problem.

After doing replace(s, "\"\"", "\\\"") manually in Notepad, I was able to read the file using uCSV.read("output_bitcointalk_unicode.csv", quotes='"', header=1), but not all of it. This is where I stopped for today. I saved the file as hdf to see if I can read it in this format.

Here is a Dropbox-Link to the file. It has about 800 MB. If you find a way to make it work, please let me know. Dropbox - File Deleted

Liso · November 20, 2017, 12:12am

I also got error at 104000-th line.

If I simplify it then problem is next:

#this is good!
julia> CSV.readsplitline(IOBuffer("104652,\"Thanks  \\ \",a"))
3-element Array{CSV.RawField,1}:
 CSV.RawField("104652", false)    
 CSV.RawField("Thanks  \\ ", true)
 CSV.RawField("a", false) 

#this is suspicios (I think that it is wrong if we like to interpret python's output)
julia> CSV.readsplitline(IOBuffer("104652,Thanks  \\,a"))
2-element Array{CSV.RawField,1}:
 CSV.RawField("104652", false)      
 CSV.RawField("Thanks  \\,a", false)

# this one end with error
julia> CSV.readsplitline(IOBuffer("104652,\"Thanks  \\\",a"))
ERROR: CSV.CSVError("EOF while trying to read the closing quote")
Stacktrace:
 [1] readsplitline!(::Array{CSV.RawField,1}, ::Base.AbstractIOBuffer{Array{UInt8,1}}, ::UInt8, ::UInt8, ::UInt8, ::Base.AbstractIOBuffer{Array{UInt8,1}}) at /home/palo/.julia/v0.6/CSV/src/io.jl:114
 [2] readsplitline(::Base.AbstractIOBuffer{Array{UInt8,1}}) at /home/palo/.julia/v0.6/CSV/src/io.jl:124

So it seems that escaping hack escape_double_quote is not enough. We have to escape backspace before quote as well.

Next worked without error (I read and split all rows):

julia> escape_double_quote(s::String) = replace(s, "\"\"", "\\\"");

julia> escape_back_quote(s::String) = replace(s, "\\\"", "\\\\\"");

julia> esca(s::String) = escape_double_quote(escape_back_quote(s));

julia> f = open("output_bitcointalk_unicode.csv?dl=0");

julia> i=0;it="";spl=[];for i in 1:2_000_000 it=readline(f); spl=CSV.readsplitline(IOBuffer(esca(it))); eof(f) && break; end

julia> i
1063934

Warning! I am not sure that you will get good data using this escaping hack!

IljaK91 · November 23, 2017, 11:19am

Thanks, the following code worked like a charm!

escape_double_quote(s::String) = replace(s, "\"\"", "\\\"")
escape_back_quote(s::String) = replace(s, "\\\"", "\\\\\"");
esca(s::String) = escape_double_quote(escape_back_quote(s));
f = open("output_bitcointalk_unicode.csv");
cleaned_file = IOBuffer(esca(readstring(f)))
test=CSV.read(cleaned_file, DataFrame)

Topic		Replies	Views
Read special characters using CSV.read New to Julia csv	22	1055	October 11, 2023
Readcsv julia 0.6.0 not working properly General Usage question	3	1019	March 28, 2017
Encoding, CSV.read / write General Usage package	5	2275	August 17, 2018
CSV won't read tab separated file General Usage csv	23	651	March 4, 2024
Julia CSV.read stopped working Data csv	32	2804	April 30, 2022

Unicode related error when reading a .csv

Related topics