Substring function?

I found first(str), last(str), and chop(str), but couldn’t find anything for getting a substring where multibyte unicode characters are involved. Something like:

substring(str, start, stop) = str[nextind(str, 0, start):nextind(str, 0, stop)]

Not sure if anyone else found this surprising. I suppose you could use chop, but my guess is substring is more common.

You may wanna check this related thread.

Yeh, been there. Just thought a substring function would be useful out of the box, with documentation about how it differs from the string[start:stop] form in terms of unicode and performance.

You can use the SubString constructor directly.

Which btw, is in one of Steve’s responses in the thread linked - here.

Unless I’m reading the docs incorrectly, SubString uses byte indexes.

Are you looking for a built-in function that outputs as in example below?

substring(s,n) = join([s[c] for (i,c) in enumerate(eachindex(s)) if i ∈ n])
Results
s = "αβγ"
substring(s,1)      # 'α'
substring(s,2)      # 'β'
substring(s,3)      # 'γ'
substring(s,1:2)    # "αβ"

Your implementation is interesting, but a little inefficient. Compare:

substring(s,n) = join([s[c] for (i,c) in enumerate(eachindex(s)) if i ∈ n])
substring(str, start, stop) = str[nextind(str, 0, start):nextind(str, 0, stop)]

and after they’ve both been warmed up…

julia> s = "αβγł€đŧŧŋ"
"αβγł€đŧŧŋ"
julia> @time substring(s,1:5)
  0.000025 seconds (7 allocations: 400 bytes)
"αβγł€"

julia> @time substring(s,1,5)
  0.000007 seconds (1 allocation: 32 bytes)
"αβγł€"

I’m mostly concerned about memory usage here. Still, pretty cool.

1 Like

You can get 0-allocations by using a view:

substring(str, start, stop) = view(str, nextind(str, 0, start):nextind(str, 0, stop))
5 Likes

Just thought a substring function would be useful out of the box, with documentation about how it differs from the string[start:stop] form in terms of unicode and performance.

Slicing a[m:n] always makes a copy in Julia (at least, with the built-in types), whether for arrays or strings. If you want to use a view (i.e. create a SubString object), the easiest way is to use @views on a block of code, e.g.

julia> s = "αβγł€đŧŧŋ"
"αβγł€đŧŧŋ"

julia> @views s[1:5]
"αβγ"

julia> typeof(ans)
SubString{String}

Slicing with @views works just fine for this.

The real question is, where are you getting these character indices that you want to pass to your substring function? Usually you get indices to a substring from some previous iteration over the string, either from your own loop or from something like a findnext call, and these give you codeunit indices that you can pass to s[m:n] directly.

If you are counting codepoints as “characters”, e.g. you want the “first 3 characters” in a string, then the odds are high that you are making a mistake. For example, "ü" is two codepoints (length("ü") == 2) because it is u followed by a combining character U+0308. See also this explanation: Myth: Counting coded characters or code points is important.

Because of Unicode’s complexity, wanting a substring from the m-th codepoint (“character”) to the n-th codepoint, as opposed to between two string indices (= code units), is actually an extremely uncommon operation (in non-buggy code). This is why it’s not built-in.

7 Likes

Maybe you will find this useful Subsetting strings in Julia using character indexing | Blog by Bogumił Kamiński

3 Likes

It’s fun to write macros like this, but I would add a warning that probably 99% of the time people do character indexing they are making a mistake in their Unicode handling.

3 Likes

So at the end of the races, what would be a simple function to perform character indexing of a unicode string?

Say a string like: s = "αβüγ", where I see 4 characters, but I am not sure anymore!

The solution in my blog does this. What @stevengj says, if I understand him correctly, is that doing character indexing is not a safe practice in general.

Thanks Bogumil, but I only saw a macro. Is there a function too?

Regarding:

Smoking neither, but there are 1 billion people who have chosen to do so…

1 Like

You can write a similar version as function. I used macro as it then can take advantage of indexing syntax.

1 Like

It depends on what you mean by character! Do you mean code points? Or do you mean grapheme clusters?

1 Like

Honestly, I have no clue and had to search.
I guess it is graphemes (user-perceived characters in unicode) as per solution posted here.

All of these arguments about byte indexing apply equally to first, last, and chop, which are part of the standard library and index on unicode codepoints.

From the julia source:

first(s::AbstractString, n::Integer) = @inbounds s[1:min(end, nextind(s, 0, n))]
last(s::AbstractString, n::Integer) = @inbounds s[max(1, prevind(s, ncodeunits(s)+1, n)):end]
function chop(s::AbstractString; head::Integer = 0, tail::Integer = 1)
    if isempty(s)
        return SubString(s)
    end
    SubString(s, nextind(s, firstindex(s), head), prevind(s, lastindex(s), tail))
end

Guess I’ll just have to start my own util package like you do in Java :rofl:

first and last are defined for any iterator. Since string iteration is over codepoints, they have to be consistent, but I agree that they need to be used with care.

As for chop, as far as I can tell it’s used to chop off a known suffix (usually an ASCII suffix so there are no issues with Unicode normalization), like a file extension, which is safe enough. (However, in starting in Julia 1.8 it will often be better to use the new chopprefix and chopsuffix functions, which only remove the prefix/suffix if it is present and which may be more efficient because they can avoid decoding the UTF-8.)

1 Like