#=
badtext.jl: Julia 1.0.3 crash on bad text data under Mac OS 10.14.3
This snippet is from a corrupted text file used in a course exercise I'm writing.
=#
text = "detract . The world will
little note, nor long remember what we say here, but it can never forget what
they did here. It is for us the living, rather, to be dedicated here to the
unfinished work which they who fought here have thus far so nobly advanced. It
is rather for us to be here dedicated to the great task remaining before us—that
from "
for ic in 1 : length(text)
c = text[ic]
if Int(c) >= 128
println("char $ic of text, '$c', has value $(Int(c))")
end
end
#=
The result:
julia> include("desktop/badtext.jl")
char 338 of text, '—', has value 8212
ERROR: LoadError: StringIndexError("detract . The world will\n little note, nor long remember what we say here, but it can never forget what\n they did here. It is for us the living, rather, to be dedicated here to the\n unfinished work which they who fought here have thus far so nobly advanced. It\n is rather for us to be here dedicated to the great task remaining before us—that\n from ", 339)
Stacktrace:
[1] string_index_err(::String, ::Int64) at ./strings/string.jl:12
[2] getindex_continued(::String, ::Int64, ::UInt32) at ./strings/string.jl:216
[3] getindex(::String, ::Int64) at ./strings/string.jl:209
[4] top-level scope at /Users/robinverdier/desktop/badtext.jl:9 [inlined]
[5] top-level scope at ./none:0
[6] include at ./boot.jl:317 [inlined]
[7] include_relative(::Module, ::String) at ./loading.jl:1044
[8] include(::Module, ::String) at ./sysimg.jl:29
[9] include(::String) at ./client.jl:392
[10] top-level scope at none:0
in expression starting at /Users/robinverdier/desktop/badtext.jl:8
The actual failing text is shown in this hexdump:
00000150 73 e2 80 94 74 68 61 74 0a 20 66 72 6f 6d 20 |s...that. from |
________
My question is not how to get around it by using try or filtering the input, but why
simply examining a character in the extended ASCII range causes this indexing error.
=#
Try this:
for ic in eachindex(text)
You are indexing through a string with Unicode (>1 byte) characters.
Hitting “—
” works but the next index (ic+1) is not at a character start, but in the middle of the multibyte character “—
”
You can find more info here:
https://docs.julialang.org/en/v1/manual/strings/index.html#Unicode-and-UTF-8-1
If you don’t need the index you can also just iterate characters directly as for c in text
.
Thank you both for your explanation of the crash on corrupted
text. If I understand correctly, when Julia finds a byte with a
value between 128 and 255, even though it’s a legitimate extended
ASCII character, it treats it and other similar bytes as belonging
to a multi-byte character, and will generate an error and crash if
the program addresses a byte inside that character. The only safe
way to access text data is then to use a special set of
sequential-only access methods, including eachindex(), thisind(),
prevind(), and nextind().
Unfortunately, the indexing is important. The snippet I
presented is actually from a venerable exercise for my course:
translation into “pig-latin”. I still haven’t made a completely
satisfactory Julia version, and the problems increase for a
subsequent useful exercise in encoding and decoding data using
variable shifts. The versions from my Python course simply treat
bytes as bytes and consequently work perfectly.
Since the course is for novices and tries to present the
simplest possible coding examples, I have no idea how to present
this counter-intuitive (to me, but evidently not to the Julia
creators) behavior.
You can convert your string to an array of bytes quite easily.
As this is an exercise I just give some hints instead of the solution.
The crucial part of your string is:
julia> t="us—t"
"us—t"
So this is a string of 4 characters (unicode):
julia> length(t)
4
But you already know that some characters have more than 1 byte as storage:
julia> sizeof(t)
6
So, what you want is a conversion from your string with variable character size to an array of bytes with, in the case of t
, 6 elements.
The conversion is up to you now and it is very straight forward in julia.
And, another hint: a byte is not an Int, a byte has no sign.
To clarify a couple of things:
- this text isn’t corrupted, it’s valid UTF-8
- Julia doesn’t crash, it raises an error indicating fairly precisely what the usage error was
If the data was actually corrupted and invalid, then Python cannot read it at all—it will crash and not allow you to process the file. Julia handles invalid text data just fine, although asking for the code point of an invalid character will cause an error since that doesn’t make sense.
Several solutions have been presented:
- Iterate characters instead of using indices from 1 to
length(text)
- Use
for i in eachindex(text)
to iterate only valid indices - Several more solutions were presented on StackOverflow
If you want to iterate code units instead of characters, you can use for i = 1:ncodeunits(text)
and then access code units with codeunit(text, i)
.