I found first(str), last(str), and chop(str), but couldn’t find anything for getting a substring where multibyte unicode characters are involved. Something like:
Yeh, been there. Just thought a substring function would be useful out of the box, with documentation about how it differs from the string[start:stop] form in terms of unicode and performance.
Your implementation is interesting, but a little inefficient. Compare:
substring(s,n) = join([s[c] for (i,c) in enumerate(eachindex(s)) if i ∈ n])
substring(str, start, stop) = str[nextind(str, 0, start):nextind(str, 0, stop)]
Just thought a substring function would be useful out of the box, with documentation about how it differs from the string[start:stop] form in terms of unicode and performance.
Slicing a[m:n]always makes a copy in Julia (at least, with the built-in types), whether for arrays or strings. If you want to use a view (i.e. create a SubString object), the easiest way is to use @views on a block of code, e.g.
The real question is, where are you getting these character indices that you want to pass to your substring function? Usually you get indices to a substring from some previous iteration over the string, either from your own loop or from something like a findnext call, and these give you codeunit indices that you can pass to s[m:n] directly.
If you are counting codepoints as “characters”, e.g. you want the “first 3 characters” in a string, then the odds are high that you are making a mistake. For example, "ü" is two codepoints (length("ü") == 2) because it is u followed by a combining characterU+0308. See also this explanation: Myth: Counting coded characters or code points is important.
Because of Unicode’s complexity, wanting a substring from the m-th codepoint (“character”) to the n-th codepoint, as opposed to between two string indices (= code units), is actually an extremely uncommon operation (in non-buggy code). This is why it’s not built-in.
It’s fun to write macros like this, but I would add a warning that probably 99% of the time people do character indexing they are making a mistake in their Unicode handling.
The solution in my blog does this. What @stevengj says, if I understand him correctly, is that doing character indexing is not a safe practice in general.
All of these arguments about byte indexing apply equally to first, last, and chop, which are part of the standard library and index on unicode codepoints.
From the julia source:
first(s::AbstractString, n::Integer) = @inbounds s[1:min(end, nextind(s, 0, n))]
last(s::AbstractString, n::Integer) = @inbounds s[max(1, prevind(s, ncodeunits(s)+1, n)):end]
function chop(s::AbstractString; head::Integer = 0, tail::Integer = 1)
if isempty(s)
return SubString(s)
end
SubString(s, nextind(s, firstindex(s), head), prevind(s, lastindex(s), tail))
end
Guess I’ll just have to start my own util package like you do in Java
first and last are defined for any iterator. Since string iteration is over codepoints, they have to be consistent, but I agree that they need to be used with care.
As for chop, as far as I can tell it’s used to chop off a known suffix (usually an ASCII suffix so there are no issues with Unicode normalization), like a file extension, which is safe enough. (However, in starting in Julia 1.8 it will often be better to use the new chopprefix and chopsuffix functions, which only remove the prefix/suffix if it is present and which may be more efficient because they can avoid decoding the UTF-8.)