Read special characters using CSV.read

Hi, I’m trying to read a file (here) that contains the following (as it appears in Excel):

image

Reading the data file, reads the special character incorrectly:

MP_names = CSV.read("All MPs 23.07.csv", DataFrame)
println(MP_names[130, [:Constituency]])
println("test ", lowercase("Ynys Môn"))
println("data", lowercase(MP_names[130, :Constituency]))

DataFrameRow
Row │ Constituency
│ String
─────┼──────────────
130 │ Ynys M\xf4n
test ynys môn
ERROR: LoadError: Base.InvalidCharError{Char}(‘\xf4’)
Stacktrace:
[1] throw_invalid_char(c::Char)
@ Base .\char.jl:86
[2] UInt32
@ .\char.jl:133 [inlined]
[3] convert
@ .\char.jl:185 [inlined]
[4] cconvert
@ .\essentials.jl:492 [inlined]
[5] lowercase(c::Char)
@ Base.Unicode .\strings\unicode.jl:289
[6] map(f::typeof(lowercase), s::String)
@ Base .\strings\basic.jl:622
[7] lowercase(s::String)
@ Base.Unicode .\strings\unicode.jl:622
[8] macro expansion
@ c:\Users.…\MPs and Constituencies.jl:24 [inlined]
[9] top-level scope
@ .\timing.jl:273
in expression starting at c:\Users.…\MPs and Constituencies.jl:16

How can I read this character properly?

Thanks

That’s probably a non-UTF 8 encoding, check the CSV docs for how to combine it with StringEncodings here:

https://csv.juliadata.org/stable/examples.html#stringencodings

How can I find out what encoding to use? I’m suddenly out of my depth here!

I can save the file from Excel to a Unicode txt file but CSV.read won’t open that at all.

MP_names = CSV.read("All MPs 23.07.txt", DataFrame)

ERROR: LoadError: ArgumentError: Symbol name may not contain \0
Stacktrace:
[1] _Symbol
@ .\boot.jl:509 [inlined]
[2] Symbol
@ .\boot.jl:515 [inlined]
[3] #10
@ .\none:0 [inlined]
[4] iterate
@ .\generator.jl:47 [inlined]
[5] collect(itr::Base.Generator{Vector{String}, CSV.var"#10#13"{Bool}})
@ Base .\array.jl:782
[6] detectcolumnnames(buf::Vector{UInt8}, headerpos::Int64, datapos::Int64, len::Int64, options::Parsers.Options, header::Any, normalizenames::Bool, oq::UInt8, eq::UInt8, cq::UInt8, cmt::Nothing, ignoreemptyrows::Bool)
@ CSV C:\Users\TGebbels.julia\packages\CSV\OnldF\src\detection.jl:185
[7] CSV.Context(source::CSV.Arg, header::CSV.Arg, normalizenames::CSV.Arg, datarow::CSV.Arg, skipto::CSV.Arg, footerskip::CSV.Arg, transpose::CSV.Arg, comment::CSV.Arg, ignoreemptyrows::CSV.Arg, ignoreemptylines::CSV.Arg, select::CSV.Arg, drop::CSV.Arg, limit::CSV.Arg, buffer_in_memory::CSV.Arg, threaded::CSV.Arg, ntasks::CSV.Arg, tasks::CSV.Arg, rows_to_check::CSV.Arg, lines_to_check::CSV.Arg, missingstrings::CSV.Arg, missingstring::CSV.Arg, delim::CSV.Arg, ignorerepeated::CSV.Arg, quoted::CSV.Arg, quotechar::CSV.Arg, openquotechar::CSV.Arg, closequotechar::CSV.Arg, escapechar::CSV.Arg, dateformat::CSV.Arg, dateformats::CSV.Arg, decimal::CSV.Arg, groupmark::CSV.Arg, truestrings::CSV.Arg, falsestrings::CSV.Arg, stripwhitespace::CSV.Arg, type::CSV.Arg, types::CSV.Arg, typemap::CSV.Arg, pool::CSV.Arg, downcast::CSV.Arg, lazystrings::CSV.Arg, stringtype::CSV.Arg, strict::CSV.Arg, silencewarnings::CSV.Arg, maxwarnings::CSV.Arg, debug::CSV.Arg, parsingdebug::CSV.Arg, validate::CSV.Arg, streaming::CSV.Arg)
@ CSV C:\Users\TGebbels.julia\packages\CSV\OnldF\src\context.jl:470
[8] file#32
@ C:\Users\TGebbels.julia\packages\CSV\OnldF\src\file.jl:222 [inlined]
[9] CSV.File(source::String)
@ CSV C:\Users\TGebbels.julia\packages\CSV\OnldF\src\file.jl:162
[10] read#118
@ C:\Users\TGebbels.julia\packages\CSV\OnldF\src\CSV.jl:117 [inlined]
[11] read
@ C:\Users\TGebbels.julia\packages\CSV\OnldF\src\CSV.jl:113 [inlined]
[12] macro expansion
@ c:\Users.…\MPs and Constituencies.jl:22 [inlined]
[13] top-level scope
@ .\timing.jl:273
in expression starting at c:\Users.…\MPs and Constituencies.jl:16

Could you try saving a file (the minimum possible) containing these particular characters and then applying the following functions to it to see what it contains?

buf=Vector{UInt8}(undef, 100)
fnz=readbytes!(open("are-there-non-utf-code.txt","r"),buf )
Char.(buf[1:fnz])
String(buf[1:fnz])

I tried saving the file containing only the following two lines using the UTF8 format [I used Word on Windows system]:
AEIÔU
aeiôu

getting this:

julia> fnz=readbytes!(open("non-utf-code.txt","r"),buf )   
20

julia> Char.(buf[1:fnz])
20-element Vector{Char}:
 'ï': Unicode U+00EF (category Ll: Letter, lowercase)      
 '»': Unicode U+00BB (category Pf: Punctuation, final quote)
 '¿': Unicode U+00BF (category Po: Punctuation, other)     
 'A': ASCII/Unicode U+0041 (category Lu: Letter, uppercase)
 'E': ASCII/Unicode U+0045 (category Lu: Letter, uppercase)
 'I': ASCII/Unicode U+0049 (category Lu: Letter, uppercase)
 'Ã': Unicode U+00C3 (category Lu: Letter, uppercase)      
 '\u94': Unicode U+0094 (category Cc: Other, control)      
 'U': ASCII/Unicode U+0055 (category Lu: Letter, uppercase)
 ' ': ASCII/Unicode U+0020 (category Zs: Separator, space) 
 '\r': ASCII/Unicode U+000D (category Cc: Other, control)  
 '\n': ASCII/Unicode U+000A (category Cc: Other, control)  
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'e': ASCII/Unicode U+0065 (category Ll: Letter, lowercase)
 'i': ASCII/Unicode U+0069 (category Ll: Letter, lowercase)
 'Ã': Unicode U+00C3 (category Lu: Letter, uppercase)      
 '´': Unicode U+00B4 (category Sk: Symbol, modifier)       
 'u': ASCII/Unicode U+0075 (category Ll: Letter, lowercase)
 '\r': ASCII/Unicode U+000D (category Cc: Other, control)  
 '\n': ASCII/Unicode U+000A (category Cc: Other, control)  

julia> String(buf[1:fnz])
"\ufeffAEIÔU \r\naeiôu\r\n"

and this:

julia> using CSV, DataFrames

julia> file = CSV.File(open("are-there-non-utf-code.txt")) 
1-element CSV.File:
 CSV.Row: (AEIÔU = String7("aeiôu"),)

julia> file |> DataFrame
1×1 DataFrame
 Row │ AEIÔU   
     │ String7
─────┼─────────
   1 │ aeiôu

[‘T’, ‘h’, ‘é’, ‘r’, ‘è’, ‘s’, ‘e’, ‘,’, ‘S’, ‘i’, ‘n’, ‘n’, ’ ', ‘F’, ‘é’, ‘i’, ‘n’, ‘,’, ‘Y’, ‘n’, ‘y’, ‘s’, ’ ', ‘M’, ‘ô’, ‘n’, ‘\r’, ‘\n’]
Th�r�se,Sinn F�in,Ynys M�n

I did this in VSCode:

buf = Vector{UInt8}(undef, 100)
fnz = readbytes!(open(“All MPs 23.07.csv”, “r”), buf)
println(Char.(buf[1:fnz]))
println(String(buf[1:fnz]))

I cut down the csv file in Excel by deleting surrounding cells (not touching the offending cells at all). In notepad I can see:

Thérèse,Sinn Féin,Ynys Môn

but if I open the whole (original) CSV file in Notepad, the same cells show as:

Th r se
Sinn F in
Ynys M n

The apparently empty spaces are filled (in notepad) with a square with a ? inside. In Excel they display correctly.

Excel shows the CSV file like this:

image

If I try (in VScode):

file = CSV.File(open(“All MPs 23.07.csv”))
println(file)

I get

CSV.File(“<IOStream: 2980782706580401464>”):
Size: 0 x 3
Tables.Schema:
ERROR: LoadError: Base.InvalidCharError{Char}(‘\xe9’)

and a very long stacktrace:

Stacktrace:
[1] throw_invalid_char(c::Char)
@ Base .\char.jl:86
[2] UInt32
@ .\char.jl:133 [inlined]
[3] convert
@ .\char.jl:185 [inlined]
[4] cconvert
@ .\essentials.jl:492 [inlined]
[5] is_id_char(c::Char)
@ Base .\show.jl:1414
[6] _all(f::typeof(Base.is_id_char), itr::Base.Iterators.Rest{String, Int64}, #unused#::Colon)
@ Base .\reduce.jl:1283
[7] all
@ .\reduce.jl:1278 [inlined]
[8] isidentifier(s::String)
@ Base .\show.jl:1442
[9] isidentifier
@ .\show.jl:1444 [inlined]
[10] show_unquoted_quote_expr(io::IOContext{IOBuffer}, value::Any, indent::Int64, prec::Int64, quote_level::Int64)
@ Base .\show.jl:1757
[11] show(io::IOContext{IOBuffer}, s::Symbol)
@ Base .\show.jl:1346
[12] sprint(f::Function, args::Symbol; context::IOContext{Base.TTY}, sizehint::Int64)
@ Base .\strings\io.jl:112
[13] sprint
@ .\strings\io.jl:107 [inlined]
[14] alignment_from_show
@ .\show.jl:2817 [inlined]
[15] alignment(io::Base.TTY, x::Symbol)
@ Base .\show.jl:2836
[16] alignment(io::Base.TTY, X::AbstractVecOrMat, rows::Vector{Int64}, cols::Vector{Int64}, cols_if_complete::Int64, cols_otherwise::Int64, sep::Int64, ncols::Int64)
@ Base .\arrayshow.jl:69
[17] _print_matrix(io::Base.TTY, X::AbstractVecOrMat, pre::String, sep::String, post::String, hdots::String, vdots::String, ddots::String, hmod::Int64, vmod::Int64, rowsA::UnitRange{Int64}, colsA::UnitRange{Int64})
@ Base .\arrayshow.jl:207
[18] print_matrix(io::Base.TTY, X::Matrix{Any}, pre::String, sep::String, post::String, hdots::String, vdots::String, ddots::String, hmod::Int64, vmod::Int64)
@ Base .\arrayshow.jl:171
[19] print_matrix
@ .\arrayshow.jl:171 [inlined]
[20] show(io::Base.TTY, sch::ERROR: Base.InvalidCharError{Char}(‘\xe9’)
Stacktrace:
[1] throw_invalid_char(c::Char)
@ Base .\char.jl:86
[2] UInt32
@ .\char.jl:133 [inlined]
[3] convert
@ .\char.jl:185 [inlined]
[4] cconvert
@ .\essentials.jl:492 [inlined]
[5] is_id_char(c::Char)
@ Base .\show.jl:1414
[6] _all(f::typeof(Base.is_id_char), itr::Base.Iterators.Rest{String, Int64}, #unused#::Colon)
@ Base .\reduce.jl:1283
[7] all
@ .\reduce.jl:1278 [inlined]
[8] isidentifier(s::String)
@ Base .\show.jl:1442
[9] isidentifier
@ .\show.jl:1444 [inlined]
[10] show_unquoted_quote_expr(io::IOContext{IOBuffer}, value::Any, indent::Int64, prec::Int64, quote_level::Int64)
@ Base .\show.jl:1757
[11] show
@ .\show.jl:1346 [inlined]
[12] show_delim_array(io::IOContext{IOBuffer}, itr::Tuple{Symbol, Symbol, Symbol}, op::Char, delim::Char, cl::Char, delim_one::Bool, i1::Int64, n::Int64)
@ Base .\show.jl:1325
[13] show_delim_array
@ .\show.jl:1310 [inlined]
[14] show(io::IOContext{IOBuffer}, t::Tuple{Symbol, Symbol, Symbol})
@ Base .\show.jl:1343
[15] show_typeparams(io::IOContext{IOBuffer}, env::Core.SimpleVector, orig::Core.SimpleVector, wheres::Vector{TypeVar})
@ Base .\show.jl:707
[16] show_datatype(io::IOContext{IOBuffer}, x::DataType, wheres::Vector{TypeVar})
@ Base .\show.jl:1092
[17] show_datatype
@ .\show.jl:1058 [inlined]
[18] _show_type(io::IOContext{IOBuffer}, x::Type)
@ Base .\show.jl:958
[19] show(io::IOContext{IOBuffer}, x::Type)
@ Base .\show.jl:950
[20] sprint(f::Function, args::Type; context::IOContext{Base.TTY}, sizehint::Int64)
@ Base .\strings\io.jl:112
[21] sprint
@ .\strings\io.jl:107 [inlined]
[22] #print_type_bicolor#540
@ .\show.jl:2491 [inlined]
[23] show_tuple_as_call(io::IOContext{Base.TTY}, name::Symbol, sig::Type; demangle::Bool, kwargs::Nothing, argnames::Vector{Symbol}, qualified::Bool, hasfirst::Bool)
@ Base .\show.jl:2472
[24] show_tuple_as_call
@ .\show.jl:2441 [inlined]
[25] show_spec_linfo(io::IOContext{Base.TTY}, frame::Base.StackTraces.StackFrame)
@ Base.StackTraces .\stacktraces.jl:244
[26] print_stackframe(io::IOContext{Base.TTY}, i::Int64, frame::Base.StackTraces.StackFrame, n::Int64, ndigits_max::Int64, modulecolor::Symbol)
@ Base .\errorshow.jl:730
[27] print_stackframe(io::IOContext{Base.TTY}, i::Int64, frame::Base.StackTraces.StackFrame, n::Int64, ndigits_max::Int64, modulecolordict::IdDict{Module, Symbol}, modulecolorcycler::Base.Iterators.Stateful{Base.Iterators.Cycle{Vector{Symbol}}, Union{Nothing, Tuple{Symbol, Int64}}, Int64})
@ Base .\errorshow.jl:695
[28] show_full_backtrace(io::IOContext{Base.TTY}, trace::Vector{Any}; print_linebreaks::Bool)
@ Base .\errorshow.jl:594
[29] show_full_backtrace
@ .\errorshow.jl:587 [inlined]
[30] show_backtrace(io::IOContext{Base.TTY}, t::Vector{Base.StackTraces.StackFrame})
@ Base .\errorshow.jl:791
[31] showerror(io::IOContext{Base.TTY}, ex::Base.InvalidCharError{Char}, bt::Vector{Base.StackTraces.StackFrame}; backtrace::Bool)
@ Base .\errorshow.jl:90
[32] showerror(io::IOContext{Base.TTY}, ex::LoadError, bt::Vector{Base.StackTraces.StackFrame}; backtrace::Bool)
@ Base .\errorshow.jl:96
[33] show_exception_stack(io::IOContext{Base.TTY}, stack::Base.ExceptionStack)
@ Base .\errorshow.jl:895
[34] display_error(io::Base.TTY, stack::Base.ExceptionStack)
@ Base .\client.jl:111
[35] display_error(stack::Base.ExceptionStack)
@ Base .\client.jl:114
[36] #invokelatest#2
@ .\essentials.jl:819 [inlined]
[37] invokelatest
@ .\essentials.jl:816 [inlined]
[38] exec_options(opts::Base.JLOptions)
@ Base .\client.jl:310
[39] _start()
@ Base .\client.jl:522

caused by: LoadError: Base.InvalidCharError{Char}(‘\xe9’)
Stacktrace:
[1] throw_invalid_char(c::Char)
@ Base .\char.jl:86
[2] UInt32
@ .\char.jl:133 [inlined]
[3] convert
@ .\char.jl:185 [inlined]
[4] cconvert
@ .\essentials.jl:492 [inlined]
[5] is_id_char(c::Char)
@ Base .\show.jl:1414
[6] _all(f::typeof(Base.is_id_char), itr::Base.Iterators.Rest{String, Int64}, #unused#::Colon)
@ Base .\reduce.jl:1283
[7] all
@ .\reduce.jl:1278 [inlined]
[8] isidentifier(s::String)
@ Base .\show.jl:1442
[9] isidentifier
@ .\show.jl:1444 [inlined]
[10] show_unquoted_quote_expr(io::IOContext{IOBuffer}, value::Any, indent::Int64, prec::Int64, quote_level::Int64)
@ Base .\show.jl:1757
[11] show(io::IOContext{IOBuffer}, s::Symbol)
@ Base .\show.jl:1346
[12] sprint(f::Function, args::Symbol; context::IOContext{Base.TTY}, sizehint::Int64)
@ Base .\strings\io.jl:112
[13] sprint
@ .\strings\io.jl:107 [inlined]
[14] alignment_from_show
@ .\show.jl:2817 [inlined]
[15] alignment(io::Base.TTY, x::Symbol)
@ Base .\show.jl:2836
[16] alignment(io::Base.TTY, X::AbstractVecOrMat, rows::Vector{Int64}, cols::Vector{Int64}, cols_if_complete::Int64, cols_otherwise::Int64, sep::Int64, ncols::Int64)
@ Base .\arrayshow.jl:69
[17] _print_matrix(io::Base.TTY, X::AbstractVecOrMat, pre::String, sep::String, post::String, hdots::String, vdots::String, ddots::String, hmod::Int64, vmod::Int64, rowsA::UnitRange{Int64}, colsA::UnitRange{Int64})
@ Base .\arrayshow.jl:207
[18] print_matrix(io::Base.TTY, X::Matrix{Any}, pre::String, sep::String, post::String, hdots::String, vdots::String, ddots::String, hmod::Int64, vmod::Int64)
@ Base .\arrayshow.jl:171
[19] print_matrix
@ .\arrayshow.jl:171 [inlined]
[20] show(io::Base.TTY, sch::fatal: error thrown and no exception handler available.
Base.InvalidCharError{Char}(char=Char(0xe9000000))
throw_invalid_char at .\char.jl:86
UInt32 at .\char.jl:133 [inlined]
convert at .\char.jl:185 [inlined]
cconvert at .\essentials.jl:492 [inlined]
is_id_char at .\show.jl:1414
_all at .\reduce.jl:1283
all at .\reduce.jl:1278 [inlined]
isidentifier at .\show.jl:1442
isidentifier at .\show.jl:1444 [inlined]
show_unquoted_quote_expr at .\show.jl:1757
show at .\show.jl:1346 [inlined]
show_delim_array at .\show.jl:1325
show_delim_array at .\show.jl:1310 [inlined]
show at .\show.jl:1343
unknown function (ip: 000002202952bb5a)
show_typeparams at .\show.jl:707
show_datatype at .\show.jl:1092
show_datatype at .\show.jl:1058 [inlined]
_show_type at .\show.jl:958
show at .\show.jl:950
jfptr_show_49738.clone_1 at C:\Users\TGebbels\AppData\Local\Programs\Julia-1.9.3\lib\julia\sys.dll (unknown line)
#sprint#484 at .\strings\io.jl:112
sprint at .\strings\io.jl:107 [inlined]
#print_type_bicolor#540 at .\show.jl:2491 [inlined]
print_type_bicolor at .\show.jl:2490
jfptr_print_type_bicolor_25682.clone_1 at C:\Users\TGebbels\AppData\Local\Programs\Julia-1.9.3\lib\julia\sys.dll (unknown line)
#show_tuple_as_call#539 at .\show.jl:2472
show_tuple_as_call at .\show.jl:2441 [inlined]
show_spec_linfo at .\stacktraces.jl:244
print_stackframe at .\errorshow.jl:730
print_stackframe at .\errorshow.jl:695
#show_full_backtrace#921 at .\errorshow.jl:594
show_full_backtrace at .\errorshow.jl:587 [inlined]
show_backtrace at .\errorshow.jl:791
#showerror#898 at .\errorshow.jl:90
showerror at .\errorshow.jl:86
unknown function (ip: 000002202952ae26)
#showerror#899 at .\errorshow.jl:96
showerror at .\errorshow.jl:94
unknown function (ip: 00000220295286a6)
show_exception_stack at .\errorshow.jl:895
display_error at .\client.jl:111
unknown function (ip: 000002202952807a)
display_error at .\client.jl:114
jfptr_display_error_32250.clone_1 at C:\Users\TGebbels\AppData\Local\Programs\Julia-1.9.3\lib\julia\sys.dll (unknown line)
jl_apply at C:/workdir/src\julia.h:1880 [inlined]
jl_f__call_latest at C:/workdir/src\builtins.c:774
#invokelatest#2 at .\essentials.jl:819 [inlined]
invokelatest at .\essentials.jl:816 [inlined]
_start at .\client.jl:524
jfptr__start_29544.clone_1 at C:\Users\TGebbels\AppData\Local\Programs\Julia-1.9.3\lib\julia\sys.dll (unknown line)
jl_apply at C:/workdir/src\julia.h:1880 [inlined]
true_main at C:/workdir/src\jlapi.c:573
jl_repl_entrypoint at C:/workdir/src\jlapi.c:717
mainCRTStartup at C:/workdir/cli\loader_exe.c:59
BaseThreadInitThunk at C:\WINDOWS\System32\KERNEL32.DLL (unknown line)
RtlUserThreadStart at C:\WINDOWS\SYSTEM32\ntdll.dll (unknown line)

The ISO-8859-1 example from the docs works fine for me?

1 Like

Yes, this works for me, too! Thanks!
How would I know a priori that my file is ISO-8859-1 encoded? I’ve looked it up and it seems to be a bit of an out of data scheme.

Another possibility here is to coax Excel into saving the file in UTF-8, which must be possible somehow. That’s pretty much the encoding everyone uses today.

It’s probably Windows-1252, actually, which is almost the same as ISO-8859-1.

It’s simple — Windows-1252 seems to be the only non-UTF8 extension of ASCII still in widespread use, at least in Western countries. If you see text that looks mostly okay in UTF-8 (because the ASCII characters are fine), but non-ASCII characters are garbled mojibake, then 99 times out of 100 it’s Windows 1252.

(Thanks to Microsoft for keeping this precious historical artifact alive.)

More generally, there are various heuristics for charset detection. These days, however, you mostly only need to check for UTF-8, Windows-1252, and UTF-16LE (which will look like complete garbage in UTF-8 because it’s not an ASCII superset).

5 Likes

I don’t know the topic, but, making sense, I would say that a logic of the type here mentioned is used.

I used the first 3 bytes of

Char.(buf[1:fnz])

to do the search

Interesting, @oheil
I saved the file to a new name using the Unicode (utf8) option you highlighted.
I assumed the more conventional approach to CSV.read would then just work (as usual):

MP_names = CSV.read("All MPs 23.07 - utf-8.csv", DataFrame)
println(MP_names[130, :Constituency])
println(lowercase(MP_names[130, :Constituency]))

But no!

Ynys M�n
ERROR: LoadError: Base.InvalidCharError{Char}(‘\xf4’)
Stacktrace:…

Even though I’m reasonably sure I did save it using utf-8 encoding (I did it several times, and the option persists if I reopen the file), reading it using ISO-8859-1 is still necessary.

Is it me or is it Microsoft?

I have no idea, I just wanted to show in an explicit way, how Excel provides the possibility. I didn’t test it with Julia to be sure. Perhaps it’s even just an option which is meant for something else because “Weboption” isn’t really what we would expect or what we want to do by just saving the file.

All I know is, character en-/decoding is quite complex and tricky and always source of surprises.

By the way, LibreOffice Calc asks explicitly for encoding when “Saving as…” and .CSV file format. Perhaps this a good option for you. Open Excel sheet with LO Calc, “Save as…”, xxx.csv, choose encoding, and, with some luck, default CSV.read may work (didn’t tried :slight_smile: ).

I think it’s mostly working for me with UTF-8?
image

julia> file = CSV.File(open("Test.csv"), header=false)
1-element CSV.File:
 CSV.Row: (Column1 = String15("Thérèse"), Column2 = String7("Si’nn"), Column3 = String7("Féin"), Column4 = String7("Ynys"), Column5 = String7("Môn"))

My excel doesn’t show this file type, “CSV UTF-8”. It’s just “CSV” here. But I would like to have this too.

Yes @Nathan_Boyer This works for me, too. Thanks.

I’m just a beginner but I’d like to restate the problem here:

CSV.read fails to read special characters in some circumstances but does not throw an error. The error is only generated when subsequent code tries to process the incorrect data (in this instance, lowercase())

CSV.read obviously tries to read the file encoded as ISO-8859-1 (or Windows-1252) and it almost succeeds, but it fails on some valid special characters.

This seems to me to be an issue with CSV.read. It should either succeed or throw an error.

Further, since Excel is globally ubiquitous and therefore likely to be a very common source of csv files but does not use UTF-8 by default, I think CSV.read should read these files without any problem.

This is by design. Even when they are encoded as UTF-8, CSV files can and often do include invalid string data. If CSV.read refused to load such files, there’d be no way to work with them short of editing them in some external tool, which would really not be great. Moreover, there’s no reason (in Julia at least) to refuse to work with such files: you can read and work with invalid string data so long a you don’t try to do something with it that isn’t well-defined for that data. In this case, you’ve asked to change the case of a byte that isn’t valid UTF-8, which doesn’t have a meaningful answer, so it throws an error. (Although, if we wanted to be really permissive, we could just leave invalid data alone while changing the case of a string.) A potential design change would be to add an option to CSV.read to validate strings as UTF-8 and turn that on by default, only letting you read invalid data if you disable that option.

2 Likes

From my point of view, your suggested design change seems like an excellent idea.

The characters I was trying to read were invalid under UTF8 but not under all encodings. Any error message might mention this, too:

Error - invalid enc_scheme_used characters found. Consider specifying a different encoding using enc"scheme_name". Alternatively, to force reading of invalid characters, use allow_invalid_chars = true

where enc_scheme_used states the encoding scheme used for this attempted CSV.read operation.