Performance of length(::String)

Diego_Javier_Zea · July 26, 2018, 7:28pm

Hi!

I’ve noticed that length(::String) isn’t a cheap operation in Julia, since it requires to iterate the string. In fact, it’s more expensive than Python len(str), even for unicode strings. Also, length(::String) is slower in Julia 0.7 than in Julia 0.6:

Julia 0.6.4

julia> const unicode_string = repeat("ñ", 100_000);

julia> length(unicode_string)
100000

julia> quantile([ @elapsed length(unicode_string) for i in 1:100 ], [0.25, 0.5, 0.75])
3-element Array{Float64,1}:
 0.00015227 
 0.000152272
 0.000152298

Julia 0.7.0-beta2.81

julia> const unicode_string = repeat("ñ", 100_000);

julia> length(unicode_string)
100000

julia> using Statistics

julia> quantile([ @elapsed length(unicode_string) for i in 1:100 ], [0.25, 0.5, 0.75])
3-element Array{Float64,1}:
 0.00046879075
 0.000469071  
 0.00050228975

Julia 0.7.0-beta2.126

julia> const unicode_string = repeat("ñ", 100_000);

julia> length(unicode_string)
100000

julia> using Statistics

julia> quantile([ @elapsed length(unicode_string) for i in 1:100 ], [0.25, 0.5, 0.75])
3-element Array{Float64,1}:
 0.00051588 
 0.000515881
 0.000515884

Python 2.7.12

In [1]: unicode_string = u'ñ' * 100000

In [2]: %time len(unicode_string)
CPU times: user 6 µs, sys: 1 µs, total: 7 µs
Wall time: 26 µs
Out[2]: 100000

Python 3.5.2

In [1]: unicode_string = 'ñ' * 100000

In [2]: %time len(unicode_string)
CPU times: user 7 µs, sys: 1 µs, total: 8 µs
Wall time: 13.6 µs
Out[2]: 100000

So, Python took 0.000013 seconds while the latest Julia took 0.000516 seconds.

Sometimes I’m using length(::String) to decide what to do with a parsed line from eachline(...).

In the particular case of testing for an empty string, isempty(::String) is faster than length(::String) == 0:

julia> quantile([ @elapsed length(unicode_string) == 0 for i in 1:100 ], [0.25, 0.5, 0.75])
3-element Array{Float64,1}:
 0.000515878
 0.00051588 
 0.0005159  

julia> quantile([ @elapsed isempty(unicode_string) for i in 1:100 ], [0.25, 0.5, 0.75])
3-element Array{Float64,1}:
 4.1e-8
 4.1e-8
 4.2e-8

But cases like length(::String) <= 3 are trickier.

Cheers,

stevengj · July 26, 2018, 7:34pm

Python 3 uses a Unicode representation specifically optimized for this operation — internally, it uses “UTF-8” (for ASCII data), “UTF-16” (for UCS-2 data), or UTF-32 depending on what codepoints the string contains, so that it never needs to use a variable-width encoding. (That being said, it’s possible that Julia’s length(::String) could be further optimized.)

The question here is why do you care? The number of codepoints in a string is required relatively rarely.

macd · July 26, 2018, 7:47pm

Or, to put it another way, maybe sizeof() will serve your needs?

ScottPJones · July 27, 2018, 5:33am

That’s not correct.
They store strings with codepoints all < 256 as Latin-1 (same as I do in Strs.jl), and have a flag to indicate if all of the codepoints are < 128 (i.e. ASCII) (which can be used directly when UTF-8 is requested). They can also cache a UTF-8 representation of strings that aren’t ASCII.

I’ve read that “UTF-8 Everywhere” manifesto before - they seem rather myopic about many issues.

Liso · July 27, 2018, 6:00am

Do you have similar type in Strs.jl?

I think that best solution to problem in this topic is not to trying improve performance in String (although it is important too) but in having more rich string-ecosystem.

oheil · July 27, 2018, 2:32pm

julia> const unicode_string = repeat("ñ", 100_000);
julia> length(unicode_string)
100000
julia> sizeof(unicode_string)
200000
julia> sizeof(unicode_string[1])
4

Is there something wrong?
The last statement should get 2 and not 4, isn’t it?

bkamins · July 27, 2018, 2:38pm

Returns you the size of Char which is 4 independent of what it holds (sizeof('A') is also 4), you should do sizeof(unicode_string[1:1]) to get 2.

ScottPJones · July 27, 2018, 4:20pm

Yes, the UniStr type (although I’ll be revamping it, because even with the Union optimizations it was not as fast as I wanted it, so I’m going to be changing it from a Union of the 4 types (ASCII, Latin1, UCS2, UTF32), to be a concrete type that does it’s own dispatching between the 4 internal “types”)

jlapeyre · July 28, 2018, 6:36am

Does python cache the number of characters ? :

 In [1]: str1 = 'a' * 10**6

In [2]: %timeit len(str1)
The slowest run took 40.29 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 47.3 ns per loop

In [3]: str2 = 'é' * 10**6

In [4]: %timeit len(str2)
10000000 loops, best of 3: 50.4 ns per loop

In [5]: print(u'\u2192')
→

In [6]: str3 = u'\u2192' * 10**6

In [7]: %timeit len(str3)
The slowest run took 82.55 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 49.1 ns per loop

In Julia length(str2) takes twice long as length(str1). And length(str3) takes three times as long, no matter how many times they are called. Of course, code that repeatedly asks for the length of a big string is not well designed. But, it’s worth understanding how python and Julia work with strings.

ScottPJones · July 28, 2018, 10:08am

Python (post 3.3) always stores strings in a fixed width encoding, so that the number of code units == the number of code points, and the operation len(str) is therefor O(1), unlike Julia, in which String uses a variable width encoding, and length(str) is an O(n) operation, where n is sizeof(str), which can be as much as 4x the number of code points.

jlapeyre · July 28, 2018, 11:07am

Yes, it’s clear that Julia always uses a single variable width encoding. The example I posted above was for python 2.7.14+. For python 3.6.4 the three times are all about 75ns. If these are all fixed-width encodings, then it’s not clear what %timeit is measuring.

stevengj · July 28, 2018, 11:08am

Most code shouldn’t need to ask for the length of a string (in codepoints) at all.

jlapeyre · July 28, 2018, 11:13am

That was implicitly part of my point. If the need to get the number of codepoints is rare, then the user code should store the value if necessary. It would add unneeded complexity to String to do this by default and transparently (and would mutate the state of String, etc…)

tk3369 · July 28, 2018, 11:48am

Is it due to the variable width nature of UTF-8?

While this can be explained, I think this is very confusing and potentially error prone for end users.

stevengj · July 28, 2018, 11:56am

If you want the index of the last character in the string, use lastindex(s); length (in any language where it returns the number of codepoints) is wrong in any variable-width encoding. Nor is length the size of the string in memory — that’s sizeof.

The number of codepoints is not typically a useful quantity because of the nature of Unicode. It is not the width of the string on the screen, nor is it the number of user-perceived “characters”, nor is it generally the size in memory.

(I agree that String indexing is confusing, but length is the least of it — the most confusing thing for new users is that string indices are not in general consecutive. This is necessary for performance in a variable-width encoding. However, if you are mucking around with individual characters in a string, as opposed to just passing around opaque cookies like filenames, you probably need to learn something about Unicode.)

bkamins · July 28, 2018, 2:59pm

This is not unique to Julia - Haskell has the same approach with Text (the difference is that UTF-16 is used for encoding), see e.g. Haskell String Types for a basic discussion of the options in Haskell, so whether it is confusing or not depends on your background .

bkamins · July 28, 2018, 3:07pm

In Julia the default mode of indexing strings is byte-based not character-based in Base.

In practice, I would recommend not to use direct indexing (unless someone really knows the internals in detail - e.g. how String handles invalid UTF-8, which is tricky), but rather functions that work on strings as a whole (like first, last or regular expressions) or use iteration.

Here you have a detailed specification how String indexing works.

StefanKarpinski · July 28, 2018, 3:19pm

The confusingness of how strings work seems a bit overstated here. If you use the index values that string functions give you then everything “just works” and there’s nothing confusing about it. The only thing that’s at all tricky is that if you want to do string index arithmetic you can’t just increment or decrement index values to get to the next/previous character index, you have to use the nextind and prevind functions. The length function tells you how many character values a string will produce if you iterate it. That’s not a terribly useful number most of the time, but it’s also not all that confusing.

tk3369 · July 28, 2018, 3:31pm

So is it possible to index into a string at a specific character location without iterating by nextind?

The confusing part comes from the difference between this

sizeof(unicode_string[1])

and

sizeof(unicode_string[1:1])

It would seem that they should have same results at a glance.

bkamins · July 28, 2018, 3:33pm

unicode_string[1] returns a Char which is always 4 bytes and unicode_string[1:1] returns a String which is UTF-8 encoded and thus can have 1, 2, 3 or 4 bytes depending on the character it contains.

Topic		Replies	Views
Super slow string performance Performance performance	2	905	March 31, 2022
Julia simple String process and file I/O is slow Performance performance , parsing	12	2938	December 23, 2020
Community string benchmark suite General Usage	3	163	October 1, 2024
String optimisation in Julia General Usage performance , strings , io	21	598	September 21, 2024
Indexing Unicode Strings Internals & Design	10	1726	June 4, 2021