"écru", what working with this ? ERROR: UnicodeError: invalid character index 2 (0xa9 is a continuation byte)

programista · February 14, 2018, 2:30pm

data reading by a=readdlm(“file.txt”,‘;’,String), file Unicode UTF8

julia> (slow[4743516,1])
“écru”

julia> length(slow[4743516,1])
4

julia> slow[4743516,1][1]
‘é’: Unicode U+00e9 (category Ll: Letter, lowercase)

julia> slow[4743516,1][2]
ERROR: UnicodeError: invalid character index 2 (0xa9 is a continuation byte)
Stacktrace:
[1] slow_utf8_next(::Ptr{UInt8}, ::UInt8, ::Int64, ::Int64) at .\strings\string.jl:172
[2] next at .\strings\string.jl:204 [inlined]
[3] getindex(::String, ::Int64) at .\strings\basic.jl:32

julia> slow[4743516,1][3]
‘c’: ASCII/Unicode U+0063 (category Ll: Letter, lowercase)

julia> slow[4743516,1][4]
‘r’: ASCII/Unicode U+0072 (category Ll: Letter, lowercase)

julia> slow[4743516,1][5]
‘u’: ASCII/Unicode U+0075 (category Ll: Letter, lowercase)

How to read this data ?
paul

bkamins · February 14, 2018, 3:12pm

Please quote code fragments in your question. Julia uses byte indexing not character indexing.
To get characters from a string use for loop, e.g.:

julia> x = "écru"
"écru"

julia> for c in x
           println(c)
       end
é
c
r
u

Not all indices are valid for indexing in this string, e.g.:

julia> for i in 1:sizeof(x)
           println("\t$i\t", isvalid(x, i) ? x[i] : "invalid index")
       end
        1       é
        2       invalid index
        3       c
        4       r
        5       u

EDIT: The topic is covered in detail in the Julia Manual

programista · February 14, 2018, 7:36pm

Yes, on my machine is the same ;

> slow[4743516,1]
"écru"

> x=slow[4743516,1]
"écru"

> for c in x
                   println(c)
               end
é
c
r
u

>

> for i in 1:sizeof(x)
                   println("\t$i\t", isvalid(x, i) ? x[i] : "invalid index")
               end
         1       é
         2       invalid index
         3       c
         4       r
         5       u

What now ?

Paul
W dniu 2018-02-14 o 16:17, Bogumił Kamiński pisze:

bkamins · February 14, 2018, 8:23pm

It is not clear what else you want to achieve. All needed information is covered in detail in the Julia Manual I sent you the link to.

PS. you quote code using ``` at start and end of code block like:

x = 1

programista · February 15, 2018, 7:45am

I read, thanks for the info, but I still do not know how to change the
first letters in my array (Slow) from lower to upper

Slow=copy(slow) for i=1:k if islower(slow[i,2][1]) Slow[i,2]=string(uppercase(slow[i,2][1]),(slow[i,2][2:end])) end end
Paul

W dniu 2018-02-14 o 21:28, Bogumił Kamiński pisze:

bkamins · February 15, 2018, 8:01am

If you want to change the whole array to uppercase use

uppercase.(slow)

if you want to uppercase only the second column of the array use

uppercase.(slow[:, 2])

Both those commands use broadcasting and create a new array.

programista · February 15, 2018, 8:13am

THX,
i need change only FIRST Char …

somthing like this:
uppercase.(map(x->x[1:min(length(x),1)], slow[:,2]))

(map(x->x[2:end], slow[:,2]))
ERROR: UnicodeError: invalid character index 2 (0x86 is a continuation byte)
Stacktrace:

unfotunatly is problem with ‘special’ Char
Paul

W dniu 2018-02-15 o 09:06, Bogumił Kamiński pisze:

Tamas_Papp · February 15, 2018, 8:42am

julia> ucfirst("écru")
"Écru"

Reading the manual would help.

bkamins · February 15, 2018, 8:58am

https://docs.julialang.org/en/latest/base/strings/ for Julia 0.7 and https://docs.julialang.org/en/release-0.6/stdlib/strings/ for Julia 0.6.2
as API has changed a bit.

stevengj · February 15, 2018, 1:21pm

Although there is a built-in function ucfirst for this, it is also useful to learn how to do it yourself. A good way to do this is to look at the Julia source code for ucfirst. (The “built-in” Julia functions are, for the most part, just plain Julia code that is no better than code that you could write yourself, except that it is written by experienced Julia programmers.)

To convert the first character of a string to uppercase, it essentially just does string(uppercase(s[1]), SubString(s, nextind(s,1))). Note in particular the use of SubString to create a view of the rest of the string without making a copy, and nextind to get the next valid index of the string.

The ucfirst implementation has a couple of additional wrinkles: it checks for the case of an empty string (in which there is no s[1]), it avoids making a new string when the first character is already upper case, and it uses titlecase (for Unicode titlecase) which is a slightly more correct thing to do if you want to capitalize the first letter of a word in Unicode.

For more complicated string transformations, typically you will use an IOBuffer to build up the string piece-by-piece. See, for example, the implementation of the titlecase function for strings.

programista · February 15, 2018, 4:07pm

Big Thx, For Large Lesson !
Paul

W dniu 2018-02-15 o 14:26, Steven G. Johnson pisze:

abcsds · April 25, 2018, 12:19pm

Hi guys,
Having this in mind I am presented with a similar problem when implementing character ngrams. That is: One needs to access the index of the next letter in the string within a for loop.

Here’s an example of the problem:

s = "erzählen"
for (i, l) in enumerate(s)
    print(l," ")
    print(s[i+1]," ") # Here happens the error for the ¨ character
    println(s[i])
end

The problem is, I might have an arbitrary number of contiguous invalid indexes, an arbitrary number of times. How then could I access the next character in the string if not with a for cycle?

The code for the edit distance is the following:

function ngram(str,n)
    isempty(str) && return []
    l = []
    max_ind = length(str[1:end-n])
    for p in 1:max_ind
        push!(l,str[p:p+n]) # Here happens the error when p+n is the index of the ¨ character
    end
    return l
end

There might be a way to do it easily, but I can’t find it. What do you guys think?

stevengj · April 25, 2018, 12:38pm

Use eachindex to get the set of valid indices in a string, and nextind to get the the next index. See the manual on string indexing.

Something like the following seems to be what you want:

s = "erzählen"
for i in eachindex(s)
    i == endof(s) && break
    print(s[i]," ")
    print(s[nextind(s,i)]," ")
    println(s[i])
end

or, somewhat more efficiently:

i, j, e = 1, nextind(s,1), endof(s)
while j ≤ e
    ci, cj = s[i], s[j]
    println(ci, ' ', cj, ' ', ci)
    i = j
    j = nextind(s, j)
end

and for ngram:

function ngram(str,n)
    l = String[]  # don't use [] since that is an untyped Any[] array
    i, j, e = 1, chr2ind(str, n), endof(str)
    while j ≤ e
        push!(l, str[i:j])
        i = nextind(str, i)
        j = nextind(str, j)
    end
    return l
end

Note that you can instead use a list l = SubString{String}[] with SubString(str, i, j) instead of s[i:j], to avoid copying the substrings.

stevengj · April 25, 2018, 1:09pm

Note also that length is the wrong thing here: that returns the number of characters in the string, which is not the same thing as the last index.

Also, if you are dealing with Unicode strings, you might want to normalize them before doing comparisons, via normalize_string(s, :NFC).

(To compute edit distance, I’m skeptical that materializing an array of ngrams will lead to the most efficient approach, as opposed to directly comparing the contents of two strings. There are lots of articles on dynamic-programming algorithms for this that could easily be ported to Julia, for example.)

abcsds · May 2, 2018, 10:08pm

Thank you so much!
Yes, I was working on edit distance and n-grams at the same time and got confused. This looks like lots of work in exchange for granularity with utf-8 characters. I also see that in v0.7 we’ll have Unicode.normalize instead of normalize_string. And nextind(str, i, n) which accepts the n number of indexes in front of the current i.
That should make this solution possible:

function ngram(str,n)
    str = Unicode.normalize(str, :NFC)
    isempty(str) && return []
    return [str[ind2char(str, p):nextind(str,p,n)] for p in 1:lastindex(str)]
end

Your solution with a while loop feel somehow strange, but it’s amazing! I couldn’t have thought of it myself, thank you.

stevengj · May 2, 2018, 10:41pm

That solution is O(length^2), whereas mine is O(length). The reason is that skipping to the k-th index is O(k). (You could do the same thing in 0.6 with chr2ind, BTW.)

You really want to think about string processing in terms of consecutive iteration through the string. And allocating arrays of substrings is rarely going to be competitive with analyzing the string data in place.

stevengj · May 2, 2018, 10:42pm

Your ind2char call is totally wrong, by the way. That doesn’t produce a valid string index in general.

Topic		Replies	Views
String indices : byte indexing feels wrong New to Julia strings , unicode	18	1408	December 5, 2023
Crash on examining corrupted text data New to Julia	5	862	October 2, 2019
Purging utf-8 bad characters General Usage	10	3577	April 21, 2018
String getindex problem? General Usage	4	870	June 20, 2020
SubString doesn't work with unicode New to Julia question , unicode	13	1443	June 17, 2022

"écru", what working with this ? ERROR: UnicodeError: invalid character index 2 (0xa9 is a continuation byte)

Related topics