StringDocument() of the TextAnalysis package

Hi there!

Has anybody been using the StringDocument() function of the TextAnalysis package for unicode encoded language strings?
However, if I use StringDocument() with a readin (greek) string from a file with the read() function, everything works:

mydata = open(“/Users/atantos/Desktop/test.txt”) do file
read(file, String)
end

StringDocument(mydata) # works fine

Here is the code line with the error message:

StringDocument(“είμαι οκ”)
Error showing value of type StringDocument{String}:
ERROR: StringIndexError: invalid index [8], valid nearby indices [7]=>‘α’, [9]=>‘ι’
Stacktrace:
[1] string_index_err(s::String, i::Int64)
@ Base ./strings/string.jl:12
[2] getindex
@ ./strings/string.jl:263 [inlined]
[3] summary(d::StringDocument{String})
@ TextAnalysis ~/.julia/packages/TextAnalysis/B0QxG/src/show.jl:16
[4] show(io::IOContext{Base.TTY}, #unused#::MIME{Symbol(“text/plain”)}, d::StringDocument{String})
@ TextAnalysis ~/.julia/packages/TextAnalysis/B0QxG/src/show.jl:45
[5] (::REPL.var"#38#39"{REPL.REPLDisplay{REPL.LineEditREPL}, MIME{Symbol(“text/plain”)}, Base.RefValue{Any}})(io::Any)
@ REPL /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:220
[6] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
@ REPL /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:462
[7] display(d::REPL.REPLDisplay, mime::MIME{Symbol(“text/plain”)}, x::Any)
@ REPL /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:213
[8] display(d::REPL.REPLDisplay, x::Any)
@ REPL /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:225
[9] display(x::Any)
@ Base.Multimedia ./multimedia.jl:328
[10] #invokelatest#2
@ ./essentials.jl:708 [inlined]
[11] invokelatest
@ ./essentials.jl:706 [inlined]
[12] print_response(errio::IO, response::Any, show_value::Bool, have_color::Bool, specialdisplay::Union{Nothing, AbstractDisplay})
@ REPL /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:247
[13] (::REPL.var"#40#41"{REPL.LineEditREPL, Pair{Any, Bool}, Bool, Bool})(io::Any)
@ REPL /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:231
[14] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
@ REPL /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:462
[15] print_response(repl::REPL.AbstractREPL, response::Any, show_value::Bool, have_color::Bool)
@ REPL /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:229
[16] (::REPL.var"#do_respond#61"{Bool, Bool, REPL.var"#72#82"{REPL.LineEditREPL, REPL.REPLHistoryProvider}, REPL.LineEditREPL, REPL.LineEdit.Prompt})(s::REPL.LineEdit.MIState, buf::Any, ok::Bool)
@ REPL /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:798
[17] #invokelatest#2
@ ./essentials.jl:708 [inlined]
[18] invokelatest
@ ./essentials.jl:706 [inlined]
[19] run_interface(terminal::REPL.Terminals.TextTerminal, m::REPL.LineEdit.ModalInterface, s::REPL.LineEdit.MIState)
@ REPL.LineEdit /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/REPL/src/LineEdit.jl:2441
[20] run_frontend(repl::REPL.LineEditREPL, backend::REPL.REPLBackendRef)
@ REPL /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:1126
[21] (::REPL.var"#44#49"{REPL.LineEditREPL, REPL.REPLBackendRef})()
@ REPL ./task.jl:411

It looks like that’s occuring in the show method, i.e. when it’s trying to print a summary. I filed fix string indexing in `summary` by ericphanson · Pull Request #257 · JuliaText/TextAnalysis.jl · GitHub to try to help fix it.

1 Like

@ericphanson

using TextAnalysis
StringDocument(“είμαι οκ”)

still not working. I suspect it has to do with the fact that the string encoding is multi-byte. Function nextind() has to be used to move to the correct character starting byte position and not land in-between.

It looks like a new release wasn’t tagged since my fix was merged, cc @avik. (It’s also possible there are other unrelated Unicode-handling bugs.)