"écru", what working with this ? ERROR: UnicodeError: invalid character index 2 (0xa9 is a continuation byte)


#1

data reading by a=readdlm(“file.txt”,’;’,String), file Unicode UTF8

julia> (slow[4743516,1])
“écru”

julia> length(slow[4743516,1])
4

julia> slow[4743516,1][1]
‘é’: Unicode U+00e9 (category Ll: Letter, lowercase)

julia> slow[4743516,1][2]
ERROR: UnicodeError: invalid character index 2 (0xa9 is a continuation byte)
Stacktrace:
[1] slow_utf8_next(::Ptr{UInt8}, ::UInt8, ::Int64, ::Int64) at .\strings\string.jl:172
[2] next at .\strings\string.jl:204 [inlined]
[3] getindex(::String, ::Int64) at .\strings\basic.jl:32

julia> slow[4743516,1][3]
‘c’: ASCII/Unicode U+0063 (category Ll: Letter, lowercase)

julia> slow[4743516,1][4]
‘r’: ASCII/Unicode U+0072 (category Ll: Letter, lowercase)

julia> slow[4743516,1][5]
‘u’: ASCII/Unicode U+0075 (category Ll: Letter, lowercase)

How to read this data ?
paul


#2

Please quote code fragments in your question. Julia uses byte indexing not character indexing.
To get characters from a string use for loop, e.g.:

julia> x = "écru"
"écru"

julia> for c in x
           println(c)
       end
é
c
r
u

Not all indices are valid for indexing in this string, e.g.:

julia> for i in 1:sizeof(x)
           println("\t$i\t", isvalid(x, i) ? x[i] : "invalid index")
       end
        1       é
        2       invalid index
        3       c
        4       r
        5       u

EDIT: The topic is covered in detail in the Julia Manual


#3

Yes, on my machine is the same ;

> slow[4743516,1]
"écru"

> x=slow[4743516,1]
"écru"

> for c in x
                   println(c)
               end
é
c
r
u

>

> for i in 1:sizeof(x)
                   println("\t$i\t", isvalid(x, i) ? x[i] : "invalid index")
               end
         1       é
         2       invalid index
         3       c
         4       r
         5       u

What now ?

Paul
W dniu 2018-02-14 o 16:17, Bogumił Kamiński pisze:


#4

It is not clear what else you want to achieve. All needed information is covered in detail in the Julia Manual I sent you the link to.

PS. you quote code using ``` at start and end of code block like:

x = 1

#5

I read, thanks for the info, but I still do not know how to change the
first letters in my array (Slow) from lower to upper

Slow=copy(slow) for i=1:k if islower(slow[i,2][1]) Slow[i,2]=string(uppercase(slow[i,2][1]),(slow[i,2][2:end])) end end
Paul

W dniu 2018-02-14 o 21:28, Bogumił Kamiński pisze:


#6

If you want to change the whole array to uppercase use

uppercase.(slow)

if you want to uppercase only the second column of the array use

uppercase.(slow[:, 2])

Both those commands use broadcasting and create a new array.


#7

THX,
i need change only FIRST Char …

somthing like this:
uppercase.(map(x->x[1:min(length(x),1)], slow[:,2]))

(map(x->x[2:end], slow[:,2]))
ERROR: UnicodeError: invalid character index 2 (0x86 is a continuation byte)
Stacktrace:

unfotunatly is problem with ‘special’ Char
Paul

W dniu 2018-02-15 o 09:06, Bogumił Kamiński pisze:


#8
julia> ucfirst("écru")
"Écru"

Reading the manual would help.


#9

:+1:
https://docs.julialang.org/en/latest/base/strings/ for Julia 0.7 and https://docs.julialang.org/en/release-0.6/stdlib/strings/ for Julia 0.6.2
as API has changed a bit.


#10

Although there is a built-in function ucfirst for this, it is also useful to learn how to do it yourself. A good way to do this is to look at the Julia source code for ucfirst. (The “built-in” Julia functions are, for the most part, just plain Julia code that is no better than code that you could write yourself, except that it is written by experienced Julia programmers.)

To convert the first character of a string to uppercase, it essentially just does string(uppercase(s[1]), SubString(s, nextind(s,1))). Note in particular the use of SubString to create a view of the rest of the string without making a copy, and nextind to get the next valid index of the string.

The ucfirst implementation has a couple of additional wrinkles: it checks for the case of an empty string (in which there is no s[1]), it avoids making a new string when the first character is already upper case, and it uses titlecase (for Unicode titlecase) which is a slightly more correct thing to do if you want to capitalize the first letter of a word in Unicode.

For more complicated string transformations, typically you will use an IOBuffer to build up the string piece-by-piece. See, for example, the implementation of the titlecase function for strings.


#11

Big Thx, For Large Lesson ! :wink:
Paul

W dniu 2018-02-15 o 14:26, Steven G. Johnson pisze:


#12

Hi guys,
Having this in mind I am presented with a similar problem when implementing character ngrams. That is: One needs to access the index of the next letter in the string within a for loop.

Here’s an example of the problem:

s = "erzählen"
for (i, l) in enumerate(s)
    print(l," ")
    print(s[i+1]," ") # Here happens the error for the ¨ character
    println(s[i])
end

The problem is, I might have an arbitrary number of contiguous invalid indexes, an arbitrary number of times. How then could I access the next character in the string if not with a for cycle?

The code for the edit distance is the following:

function ngram(str,n)
    isempty(str) && return []
    l = []
    max_ind = length(str[1:end-n])
    for p in 1:max_ind
        push!(l,str[p:p+n]) # Here happens the error when p+n is the index of the ¨ character
    end
    return l
end

There might be a way to do it easily, but I can’t find it. What do you guys think?


#13

Use eachindex to get the set of valid indices in a string, and nextind to get the the next index. See the manual on string indexing.

Something like the following seems to be what you want:

s = "erzählen"
for i in eachindex(s)
    i == endof(s) && break
    print(s[i]," ")
    print(s[nextind(s,i)]," ")
    println(s[i])
end

or, somewhat more efficiently:

i, j, e = 1, nextind(s,1), endof(s)
while j ≤ e
    ci, cj = s[i], s[j]
    println(ci, ' ', cj, ' ', ci)
    i = j
    j = nextind(s, j)
end

and for ngram:

function ngram(str,n)
    l = String[]  # don't use [] since that is an untyped Any[] array
    i, j, e = 1, chr2ind(str, n), endof(str)
    while j ≤ e
        push!(l, str[i:j])
        i = nextind(str, i)
        j = nextind(str, j)
    end
    return l
end

Note that you can instead use a list l = SubString{String}[] with SubString(str, i, j) instead of s[i:j], to avoid copying the substrings.


#14

Note also that length is the wrong thing here: that returns the number of characters in the string, which is not the same thing as the last index.

Also, if you are dealing with Unicode strings, you might want to normalize them before doing comparisons, via normalize_string(s, :NFC).

(To compute edit distance, I’m skeptical that materializing an array of ngrams will lead to the most efficient approach, as opposed to directly comparing the contents of two strings. There are lots of articles on dynamic-programming algorithms for this that could easily be ported to Julia, for example.)


#15

Thank you so much!
Yes, I was working on edit distance and n-grams at the same time and got confused. This looks like lots of work in exchange for granularity with utf-8 characters. I also see that in v0.7 we’ll have Unicode.normalize instead of normalize_string. And nextind(str, i, n) which accepts the n number of indexes in front of the current i.
That should make this solution possible:

function ngram(str,n)
    str = Unicode.normalize(str, :NFC)
    isempty(str) && return []
    return [str[ind2char(str, p):nextind(str,p,n)] for p in 1:lastindex(str)]
end

Your solution with a while loop feel somehow strange, but it’s amazing! I couldn’t have thought of it myself, thank you.


#16

That solution is O(length^2), whereas mine is O(length). The reason is that skipping to the k-th index is O(k). (You could do the same thing in 0.6 with chr2ind, BTW.)

You really want to think about string processing in terms of consecutive iteration through the string. And allocating arrays of substrings is rarely going to be competitive with analyzing the string data in place.


#17

Your ind2char call is totally wrong, by the way. That doesn’t produce a valid string index in general.