Truncate String

Hi! I’m having trouble with a simple problem of truncating strings to, say, 10 characters. As strings are not directly indexed with character indices, solutions s[1:10] (like in Python) or s[1:min(length(s), 10)] don’t work. They throw indexing exception, as explained in docs. Couldn’t find the solution in documentation as well - so how to truncate strings properly?

You can use first(str, n) for this:

julia> first("aαbβ", 3)
"aαb"

julia> first("aαbβ", 4)
"aαbβ"

julia> first("aαbβ", 5)
"aαbβ"

Internally this is implemented in terms of nextind, so you can see how it would be possible to do similar things yourself:

first(s::AbstractString, n::Integer) = s[1:min(end, nextind(s, 0, n))]
6 Likes

Replace length(s) with end.

1 Like

If that’s a fix for my simple solution, it doesn’t work - s[1:min(end, 10)] still throws an error.

Thank you, it works! Does this mean that the easiest way to get a general slice of a string from nth to mth character is a pretty verbose first(last(s, length(s) - n), m - n)? Direct indexing is even much worse - s[max(1, prevind(s, ncodeunits(s)+1, n)):min(end, nextind(s, 0, n))], so there is no shortcut similarly compact as s[n:m].

You might need nextind and prevind when dealing with unicodes.

julia> a = "aαsdfβ"
"aαsdfβ"

julia> sizeof(a)
8

julia> nextind(a, 1)
2

julia> nextind(a, 2)
4

julia> nextind(a, 4)
5

julia> nextind(a, 6)
7

julia> nextind(a, 5)
6

julia> nextind(a, 6)
7

julia> nextind(a, 7)
9

julia> nextind(a, 9)
ERROR: BoundsError: attempt to access "aαsdfβ"
  at index [9]

julia> prevind(a, 9)
7

julia> prevind(a, 7)
6

julia> prevind(a, 6)
5

julia> prevind(a, 5)
4

julia> prevind(a, 4)
2

julia> prevind(a, 2)
1

julia> prevind(a, 1)
0

julia> prevind(a, 0)
ERROR: BoundsError: attempt to access "aαsdfβ"

It works if 10 is a valid string index. If you want 10 characters you need to find the index of the tenth character.

Sure, the index needs to be valid. As I understand the only way to these indices is through nextind/prevind, and these functions are needed for basically any kind of string manipulation not covered by other library functions. The vast majority (I think) of string users think in terms of characters and not their byte representation, so this makes even simple slicing very verbose and error-prone. E.g. my code for slicing from a previous post turns out not completely correct, a better version is
slice(s, n, m) = first(last(s, max(0, length(s) - (n - 1))), m - (n - 1)).
Note the off-by-1 fix and explicit max for strings shorter than n symbols. And I’m still not sure if it covers all cases.

The string API is like this because indexing with character indices is not an O(1) operation for encodings where a character and code point don’t coincide (including UTF-8). This makes slicing a long string in terms of character indices a major performance trap. Hence the nextind style interface where people are encouraged to think in terms of iteration.

We have several convenience functions first, last, chop, startswith, endswith which refer to the beginning and end of the string, but I can’t see a function for pulling a substring of N chars out of the middle, in the spirit of your slice. I do think there could be a convenience function for this; something like

function slice(s, r::UnitRange)
    i = nextind(s, 0, first(r))
    i <= lastindex(s) || return ""
    j = min(lastindex(s), nextind(s, i, length(r)-1))
    s[i:j]
end

I’m not sure about the best name for this or the most convenient/robust method of handling errors. (For example, first just returns less chars than requested if the string is too short. This version of slice does the same.) It should also come with documentation to note that it’s probably not a good idea to use for long strings.

[edit - fixed a problem with indexing off the end]

slice(s, n, m) = s[nextind(s, 0, n):nextind(s, 0, m)]

Thanks again, the code and explanation of the API is definitely helpful!

With this implementation of slice we have:

s = "abcdef"
first(s, 10) == "abcdef"  # OK
slice(s, 1, 10)  # error - instead of giving the same as prev line

Sure, if you want to ignore out of bounds character indices, you’ll need some min/max calls.

slice(s, n, m) = s[max(1, nextind(s, 0, n)):min(end, nextind(s, 0, m))]

It seems debatable to me that this is the preferred behavior for a function intended to take slices, but if that’s what you want, that’s how you do it.

1 Like

Glad to hear it. Truth be told, I was eyeing Stefan’s concise version and wondering if I’d written far too much :slight_smile: