Substring function?

rogerkeays · February 18, 2022, 7:41am

I found first(str), last(str), and chop(str), but couldn’t find anything for getting a substring where multibyte unicode characters are involved. Something like:

substring(str, start, stop) = str[nextind(str, 0, start):nextind(str, 0, stop)]

Not sure if anyone else found this surprising. I suppose you could use chop, but my guess is substring is more common.

rafael.guerra · February 18, 2022, 7:58am

You may wanna check this related thread.

rogerkeays · February 18, 2022, 8:04am

Yeh, been there. Just thought a substring function would be useful out of the box, with documentation about how it differs from the string[start:stop] form in terms of unicode and performance.

fredrikekre · February 18, 2022, 8:21am

You can use the SubString constructor directly.

rafael.guerra · February 18, 2022, 8:37am

Which btw, is in one of Steve’s responses in the thread linked - here.

rogerkeays · February 18, 2022, 8:39am

Unless I’m reading the docs incorrectly, SubString uses byte indexes.

rafael.guerra · February 18, 2022, 9:05am

Are you looking for a built-in function that outputs as in example below?

substring(s,n) = join([s[c] for (i,c) in enumerate(eachindex(s)) if i ∈ n])

Results

s = "αβγ"
substring(s,1)      # 'α'
substring(s,2)      # 'β'
substring(s,3)      # 'γ'
substring(s,1:2)    # "αβ"

rogerkeays · February 18, 2022, 9:42am

Your implementation is interesting, but a little inefficient. Compare:

substring(s,n) = join([s[c] for (i,c) in enumerate(eachindex(s)) if i ∈ n])
substring(str, start, stop) = str[nextind(str, 0, start):nextind(str, 0, stop)]

and after they’ve both been warmed up…

julia> s = "αβγł€đŧŧŋ"
"αβγł€đŧŧŋ"
julia> @time substring(s,1:5)
  0.000025 seconds (7 allocations: 400 bytes)
"αβγł€"

julia> @time substring(s,1,5)
  0.000007 seconds (1 allocation: 32 bytes)
"αβγł€"

I’m mostly concerned about memory usage here. Still, pretty cool.

rafael.guerra · February 18, 2022, 9:55am

You can get 0-allocations by using a view:

substring(str, start, stop) = view(str, nextind(str, 0, start):nextind(str, 0, stop))

stevengj · February 18, 2022, 2:42pm

Just thought a substring function would be useful out of the box, with documentation about how it differs from the string[start:stop] form in terms of unicode and performance.

Slicing a[m:n] always makes a copy in Julia (at least, with the built-in types), whether for arrays or strings. If you want to use a view (i.e. create a SubString object), the easiest way is to use @views on a block of code, e.g.

julia> s = "αβγł€đŧŧŋ"
"αβγł€đŧŧŋ"

julia> @views s[1:5]
"αβγ"

julia> typeof(ans)
SubString{String}

Slicing with @views works just fine for this.

The real question is, where are you getting these character indices that you want to pass to your substring function? Usually you get indices to a substring from some previous iteration over the string, either from your own loop or from something like a findnext call, and these give you codeunit indices that you can pass to s[m:n] directly.

If you are counting codepoints as “characters”, e.g. you want the “first 3 characters” in a string, then the odds are high that you are making a mistake. For example, "ü" is two codepoints (length("ü") == 2) because it is u followed by a combining character U+0308. See also this explanation: Myth: Counting coded characters or code points is important.

Because of Unicode’s complexity, wanting a substring from the m-th codepoint (“character”) to the n-th codepoint, as opposed to between two string indices (= code units), is actually an extremely uncommon operation (in non-buggy code). This is why it’s not built-in.

bkamins · February 18, 2022, 3:41pm

Maybe you will find this useful Subsetting strings in Julia using character indexing | Blog by Bogumił Kamiński

stevengj · February 18, 2022, 3:51pm

It’s fun to write macros like this, but I would add a warning that probably 99% of the time people do character indexing they are making a mistake in their Unicode handling.

rafael.guerra · February 18, 2022, 4:01pm

So at the end of the races, what would be a simple function to perform character indexing of a unicode string?

Say a string like: s = "αβüγ", where I see 4 characters, but I am not sure anymore!

bkamins · February 18, 2022, 4:12pm

The solution in my blog does this. What @stevengj says, if I understand him correctly, is that doing character indexing is not a safe practice in general.

rafael.guerra · February 18, 2022, 4:31pm

Thanks Bogumil, but I only saw a macro. Is there a function too?

Regarding:

Smoking neither, but there are 1 billion people who have chosen to do so…

bkamins · February 18, 2022, 4:42pm

You can write a similar version as function. I used macro as it then can take advantage of indexing syntax.

StefanKarpinski · February 18, 2022, 4:51pm

It depends on what you mean by character! Do you mean code points? Or do you mean grapheme clusters?

rafael.guerra · February 18, 2022, 5:11pm

Honestly, I have no clue and had to search.
I guess it is graphemes (user-perceived characters in unicode) as per solution posted here.

rogerkeays · February 19, 2022, 4:48am

All of these arguments about byte indexing apply equally to first, last, and chop, which are part of the standard library and index on unicode codepoints.

From the julia source:

first(s::AbstractString, n::Integer) = @inbounds s[1:min(end, nextind(s, 0, n))]
last(s::AbstractString, n::Integer) = @inbounds s[max(1, prevind(s, ncodeunits(s)+1, n)):end]
function chop(s::AbstractString; head::Integer = 0, tail::Integer = 1)
    if isempty(s)
        return SubString(s)
    end
    SubString(s, nextind(s, firstindex(s), head), prevind(s, lastindex(s), tail))
end

Guess I’ll just have to start my own util package like you do in Java

stevengj · February 19, 2022, 4:59am

first and last are defined for any iterator. Since string iteration is over codepoints, they have to be consistent, but I agree that they need to be used with care.

As for chop, as far as I can tell it’s used to chop off a known suffix (usually an ASCII suffix so there are no issues with Unicode normalization), like a file extension, which is safe enough. (However, in starting in Julia 1.8 it will often be better to use the new chopprefix and chopsuffix functions, which only remove the prefix/suffix if it is present and which may be more efficient because they can avoid decoding the UTF-8.)

Topic		Replies	Views
Julia substring return empty string New to Julia	8	985	April 23, 2019
SubString doesn't work with unicode New to Julia question , unicode	13	1411	June 17, 2022
Counting special characters ü, å, ø, etc General Usage strings , unicode	11	713	April 1, 2022
String slicing General Usage	3	2655	October 25, 2018
Any difference between : or , in the SubString() method? New to Julia	2	276	September 24, 2020

Substring function?

Related topics