Truncate String

aplavin · August 26, 2019, 12:52am

Hi! I’m having trouble with a simple problem of truncating strings to, say, 10 characters. As strings are not directly indexed with character indices, solutions s[1:10] (like in Python) or s[1:min(length(s), 10)] don’t work. They throw indexing exception, as explained in docs. Couldn’t find the solution in documentation as well - so how to truncate strings properly?

c42f · August 26, 2019, 1:02am

You can use first(str, n) for this:

julia> first("aαbβ", 3)
"aαb"

julia> first("aαbβ", 4)
"aαbβ"

julia> first("aαbβ", 5)
"aαbβ"

Internally this is implemented in terms of nextind, so you can see how it would be possible to do similar things yourself:

first(s::AbstractString, n::Integer) = s[1:min(end, nextind(s, 0, n))]

StefanKarpinski · August 26, 2019, 1:03am

Replace length(s) with end.

aplavin · August 26, 2019, 1:07am

If that’s a fix for my simple solution, it doesn’t work - s[1:min(end, 10)] still throws an error.

aplavin · August 26, 2019, 1:10am

Thank you, it works! Does this mean that the easiest way to get a general slice of a string from nth to mth character is a pretty verbose first(last(s, length(s) - n), m - n)? Direct indexing is even much worse - s[max(1, prevind(s, ncodeunits(s)+1, n)):min(end, nextind(s, 0, n))], so there is no shortcut similarly compact as s[n:m].

thautwarm · August 26, 2019, 1:37am

You might need nextind and prevind when dealing with unicodes.

julia> a = "aαsdfβ"
"aαsdfβ"

julia> sizeof(a)
8

julia> nextind(a, 1)
2

julia> nextind(a, 2)
4

julia> nextind(a, 4)
5

julia> nextind(a, 6)
7

julia> nextind(a, 5)
6

julia> nextind(a, 6)
7

julia> nextind(a, 7)
9

julia> nextind(a, 9)
ERROR: BoundsError: attempt to access "aαsdfβ"
  at index [9]

julia> prevind(a, 9)
7

julia> prevind(a, 7)
6

julia> prevind(a, 6)
5

julia> prevind(a, 5)
4

julia> prevind(a, 4)
2

julia> prevind(a, 2)
1

julia> prevind(a, 1)
0

julia> prevind(a, 0)
ERROR: BoundsError: attempt to access "aαsdfβ"

StefanKarpinski · August 26, 2019, 2:03am

It works if 10 is a valid string index. If you want 10 characters you need to find the index of the tenth character.

aplavin · August 26, 2019, 2:41am

Sure, the index needs to be valid. As I understand the only way to these indices is through nextind/prevind, and these functions are needed for basically any kind of string manipulation not covered by other library functions. The vast majority (I think) of string users think in terms of characters and not their byte representation, so this makes even simple slicing very verbose and error-prone. E.g. my code for slicing from a previous post turns out not completely correct, a better version is
slice(s, n, m) = first(last(s, max(0, length(s) - (n - 1))), m - (n - 1)).
Note the off-by-1 fix and explicit max for strings shorter than n symbols. And I’m still not sure if it covers all cases.

c42f · August 26, 2019, 3:48am

The string API is like this because indexing with character indices is not an O(1) operation for encodings where a character and code point don’t coincide (including UTF-8). This makes slicing a long string in terms of character indices a major performance trap. Hence the nextind style interface where people are encouraged to think in terms of iteration.

We have several convenience functions first, last, chop, startswith, endswith which refer to the beginning and end of the string, but I can’t see a function for pulling a substring of N chars out of the middle, in the spirit of your slice. I do think there could be a convenience function for this; something like

function slice(s, r::UnitRange)
    i = nextind(s, 0, first(r))
    i <= lastindex(s) || return ""
    j = min(lastindex(s), nextind(s, i, length(r)-1))
    s[i:j]
end

I’m not sure about the best name for this or the most convenient/robust method of handling errors. (For example, first just returns less chars than requested if the string is too short. This version of slice does the same.) It should also come with documentation to note that it’s probably not a good idea to use for long strings.

[edit - fixed a problem with indexing off the end]

StefanKarpinski · August 26, 2019, 3:48am

slice(s, n, m) = s[nextind(s, 0, n):nextind(s, 0, m)]

aplavin · August 26, 2019, 4:00am

Thanks again, the code and explanation of the API is definitely helpful!

aplavin · August 26, 2019, 4:01am

With this implementation of slice we have:

s = "abcdef"
first(s, 10) == "abcdef"  # OK
slice(s, 1, 10)  # error - instead of giving the same as prev line

StefanKarpinski · August 26, 2019, 4:06am

Sure, if you want to ignore out of bounds character indices, you’ll need some min/max calls.

slice(s, n, m) = s[max(1, nextind(s, 0, n)):min(end, nextind(s, 0, m))]

It seems debatable to me that this is the preferred behavior for a function intended to take slices, but if that’s what you want, that’s how you do it.

c42f · August 26, 2019, 4:09am

Glad to hear it. Truth be told, I was eyeing Stefan’s concise version and wondering if I’d written far too much

Topic		Replies	Views
String slicing General Usage	3	2723	October 25, 2018
String indexing New to Julia indexing	11	3395	April 11, 2020
Julia substring return empty string New to Julia	8	1019	April 23, 2019
Substring function? New to Julia strings , unicode	42	4019	July 18, 2022
Breakage due to changes in `String` slicing in v0.7 Internals & Design	35	2314	February 12, 2018

Truncate String

Related topics