Julia 0.6 Unicode Parsing Problem

strings

#1

Julia 0.6.0-rc1.0 can’t properly load a Unicode character in a settings file I have, specifically “³” (U+00b3). Julia doesn’t have a problem if it’s a UTF-8 Unicode file but the settings files we have are generated in ISO-8859 format.

Julia 0.6 does load the line but the character that should be a ³ shows up as a symbol with a question mark in it (see below). Using readline to load the line returns a 30-byte String of invalid UTF-8 data.

Specific Problem Line

CO2, 10 , 40 , mmol/m³ , 0 , E

Error Message:

CO2, 10 , 40 , mmol/m� , 0 , E
ERROR: LoadError: at row 1, column 3 : UnicodeError: invalid character index)
Stacktrace:
[1] dlm_parse(::String, ::Char, ::Char, ::Char, ::Char, ::Bool, ::Bool, ::Bool, ::Int64, ::Bool, ::Base.DataFmt.DLMOffsets) at ./datafmt.jl:610
[2] readdlm_string(::String, ::Char, ::Type, ::Char, ::Bool, ::Dict{Symbol,Union{Char, Integer, Tuple{Integer,Integer}}}) at ./datafmt.jl:343
[3] readdlm_auto(::Base.AbstractIOBuffer{Array{UInt8,1}}, ::Char, ::Type{T} where T, ::Char, ::Bool) at ./datafmt.jl:119
[4] #readdlm#7 at ./datafmt.jl:81 [inlined]
[5] readdlm(::Base.AbstractIOBuffer{Array{UInt8,1}}, ::Char, ::Char) at ./datafmt.jl:81
[6] #readdlm#6 at ./datafmt.jl:73 [inlined]
[7] readdlm(::Base.AbstractIOBuffer{Array{UInt8,1}}, ::Char) at ./datafmt.jl:73
[8] #readcsv#14 at ./datafmt.jl:618 [inlined]
[9] readcsv(::Base.AbstractIOBuffer{Array{UInt8,1}}) at ./datafmt.jl:618
[10] slt_configload(::Array{String,1}) at /home/user/.julia/v0.6/MyModule/src/slt_configload.jl:67
[11] slt_config(::Array{String,1}) at /home/user/.julia/v0.6/MyModule/src/slt_config.jl:28
[12] #slt_load#21(::Bool, ::Bool, ::Float64, ::Function, ::String, ::DateTime, ::DateTime) at /home/user/.julia/v0.6/MyModule/src/slt_load.jl:115
[13] (::MyModule.#kw##slt_load)(::Array{Any,1}, ::MyModule.#slt_load, ::String, ::DateTime, ::DateTime) at ./:0
[14] include_from_node1(::String) at ./loading.jl:552
[15] include(::String) at ./sysimg.jl:14
[16] process_options(::Base.JLOptions) at ./client.jl:305
[17] _start() at ./client.jl:371
while loading /home/user/.julia/v0.6/MyModule/test/runtests.jl, in expression starting on line 11

Julia 0.5 didn’t have this problem.

Is there a setting I’ve missed that will allow Julia to parse it properly?


#2

I can’t reproduce your problem.

julia> readcsv(IOBuffer("CO2, 10 , 40 , mmol/m³ , 0 , E"))
1×6 Array{Any,2}:
 "CO2"  10  40  " mmol/m³ "  0  " E"

works fine for me in Julia 0.6.

Are you sure that your file is UTF-8 encoded? You didn’t accidentally save it in some other encoding? Could it be a terminal issue on Windows?


#3

Oh, I didn’t notice your comment but the settings files we have are generated in ISO-8859 format.

If your data is not valid UTF-8, it’s not to surprising that we throw a Unicode error when trying to parse it as a string. I’m surprised that Julia 0.5 didn’t complain, though.

(You can probably use https://github.com/nalimilan/StringEncodings.jl convert ISO-8859 streams into UTF-8 streams.)


#4

It worked in Julia 0.5 so I figured it would work in Julia 0.6. I tried looking through the Julia repo for relevant changes but my Git-fu is lacking.

Is there a good way of programmatically determining the file encoding (preferably without adding new packages)?


#5

Some things have changed in Julia 0.6 regarding string parsing, which could have made this more visible. See for example this closed issue. Anyway when the input is invalid it is not guaranteed that you will get an error AFAIK (we could provide a way to enable input validation, but that’s a different issue).

No. Even with external libraries, this is at best a fragile operation. See the detect function of JuliaStrings/ICU.jl (not the registered ICU.jl).