Performance of length(::String)

Hi!

I’ve noticed that length(::String) isn’t a cheap operation in Julia, since it requires to iterate the string. In fact, it’s more expensive than Python len(str), even for unicode strings. Also, length(::String) is slower in Julia 0.7 than in Julia 0.6:

Julia 0.6.4
julia> const unicode_string = repeat("ñ", 100_000);

julia> length(unicode_string)
100000

julia> quantile([ @elapsed length(unicode_string) for i in 1:100 ], [0.25, 0.5, 0.75])
3-element Array{Float64,1}:
 0.00015227 
 0.000152272
 0.000152298
Julia 0.7.0-beta2.81
julia> const unicode_string = repeat("ñ", 100_000);

julia> length(unicode_string)
100000

julia> using Statistics

julia> quantile([ @elapsed length(unicode_string) for i in 1:100 ], [0.25, 0.5, 0.75])
3-element Array{Float64,1}:
 0.00046879075
 0.000469071  
 0.00050228975
Julia 0.7.0-beta2.126
julia> const unicode_string = repeat("ñ", 100_000);

julia> length(unicode_string)
100000

julia> using Statistics

julia> quantile([ @elapsed length(unicode_string) for i in 1:100 ], [0.25, 0.5, 0.75])
3-element Array{Float64,1}:
 0.00051588 
 0.000515881
 0.000515884
Python 2.7.12
In [1]: unicode_string = u'ñ' * 100000

In [2]: %time len(unicode_string)
CPU times: user 6 µs, sys: 1 µs, total: 7 µs
Wall time: 26 µs
Out[2]: 100000
Python 3.5.2
In [1]: unicode_string = 'ñ' * 100000

In [2]: %time len(unicode_string)
CPU times: user 7 µs, sys: 1 µs, total: 8 µs
Wall time: 13.6 µs
Out[2]: 100000

So, Python took 0.000013 seconds while the latest Julia took 0.000516 seconds.

Sometimes I’m using length(::String) to decide what to do with a parsed line from eachline(...).

In the particular case of testing for an empty string, isempty(::String) is faster than length(::String) == 0:

julia> quantile([ @elapsed length(unicode_string) == 0 for i in 1:100 ], [0.25, 0.5, 0.75])
3-element Array{Float64,1}:
 0.000515878
 0.00051588 
 0.0005159  

julia> quantile([ @elapsed isempty(unicode_string) for i in 1:100 ], [0.25, 0.5, 0.75])
3-element Array{Float64,1}:
 4.1e-8
 4.1e-8
 4.2e-8

But cases like length(::String) <= 3 are trickier.

Cheers,

2 Likes

Python 3 uses a Unicode representation specifically optimized for this operation — internally, it uses “UTF-8” (for ASCII data), “UTF-16” (for UCS-2 data), or UTF-32 depending on what codepoints the string contains, so that it never needs to use a variable-width encoding. (That being said, it’s possible that Julia’s length(::String) could be further optimized.)

The question here is why do you care? The number of codepoints in a string is required relatively rarely.

1 Like

Or, to put it another way, maybe sizeof() will serve your needs?

That’s not correct.
They store strings with codepoints all < 256 as Latin-1 (same as I do in Strs.jl), and have a flag to indicate if all of the codepoints are < 128 (i.e. ASCII) (which can be used directly when UTF-8 is requested). They can also cache a UTF-8 representation of strings that aren’t ASCII.

I’ve read that “UTF-8 Everywhere” manifesto before - they seem rather myopic about many issues.

1 Like

Do you have similar type in Strs.jl?

I think that best solution to problem in this topic is not to trying improve performance in String (although it is important too) but in having more rich string-ecosystem.

julia> const unicode_string = repeat("ñ", 100_000);
julia> length(unicode_string)
100000
julia> sizeof(unicode_string)
200000
julia> sizeof(unicode_string[1])
4

Is there something wrong?
The last statement should get 2 and not 4, isn’t it?

Returns you the size of Char which is 4 independent of what it holds (sizeof('A') is also 4), you should do sizeof(unicode_string[1:1]) to get 2.

2 Likes

Yes, the UniStr type (although I’ll be revamping it, because even with the Union optimizations it was not as fast as I wanted it, so I’m going to be changing it from a Union of the 4 types (ASCII, Latin1, UCS2, UTF32), to be a concrete type that does it’s own dispatching between the 4 internal “types”)

2 Likes

Does python cache the number of characters ? :

 In [1]: str1 = 'a' * 10**6

In [2]: %timeit len(str1)
The slowest run took 40.29 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 47.3 ns per loop

In [3]: str2 = 'é' * 10**6

In [4]: %timeit len(str2)
10000000 loops, best of 3: 50.4 ns per loop

In [5]: print(u'\u2192')
→

In [6]: str3 = u'\u2192' * 10**6

In [7]: %timeit len(str3)
The slowest run took 82.55 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 49.1 ns per loop

In Julia length(str2) takes twice long as length(str1). And length(str3) takes three times as long, no matter how many times they are called. Of course, code that repeatedly asks for the length of a big string is not well designed. But, it’s worth understanding how python and Julia work with strings.

Python (post 3.3) always stores strings in a fixed width encoding, so that the number of code units == the number of code points, and the operation len(str) is therefor O(1), unlike Julia, in which String uses a variable width encoding, and length(str) is an O(n) operation, where n is sizeof(str), which can be as much as 4x the number of code points.

1 Like

Yes, it’s clear that Julia always uses a single variable width encoding. The example I posted above was for python 2.7.14+. For python 3.6.4 the three times are all about 75ns. If these are all fixed-width encodings, then it’s not clear what %timeit is measuring.

Most code shouldn’t need to ask for the length of a string (in codepoints) at all.

2 Likes

That was implicitly part of my point. If the need to get the number of codepoints is rare, then the user code should store the value if necessary. It would add unneeded complexity to String to do this by default and transparently (and would mutate the state of String, etc…)

Is it due to the variable width nature of UTF-8?

While this can be explained, I think this is very confusing and potentially error prone for end users.

If you want the index of the last character in the string, use lastindex(s); length (in any language where it returns the number of codepoints) is wrong in any variable-width encoding. Nor is length the size of the string in memory — that’s sizeof.

The number of codepoints is not typically a useful quantity because of the nature of Unicode. It is not the width of the string on the screen, nor is it the number of user-perceived “characters”, nor is it generally the size in memory.

(I agree that String indexing is confusing, but length is the least of it — the most confusing thing for new users is that string indices are not in general consecutive. This is necessary for performance in a variable-width encoding. However, if you are mucking around with individual characters in a string, as opposed to just passing around opaque cookies like filenames, you probably need to learn something about Unicode.)

2 Likes

This is not unique to Julia - Haskell has the same approach with Text (the difference is that UTF-16 is used for encoding), see e.g. Haskell String Types for a basic discussion of the options in Haskell, so whether it is confusing or not depends on your background :slight_smile:.

In Julia the default mode of indexing strings is byte-based not character-based in Base.

In practice, I would recommend not to use direct indexing (unless someone really knows the internals in detail - e.g. how String handles invalid UTF-8, which is tricky), but rather functions that work on strings as a whole (like first, last or regular expressions) or use iteration.

Here you have a detailed specification how String indexing works.

1 Like

The confusingness of how strings work seems a bit overstated here. If you use the index values that string functions give you then everything “just works” and there’s nothing confusing about it. The only thing that’s at all tricky is that if you want to do string index arithmetic you can’t just increment or decrement index values to get to the next/previous character index, you have to use the nextind and prevind functions. The length function tells you how many character values a string will produce if you iterate it. That’s not a terribly useful number most of the time, but it’s also not all that confusing.

1 Like

So is it possible to index into a string at a specific character location without iterating by nextind?

The confusing part comes from the difference between this

sizeof(unicode_string[1])

and

sizeof(unicode_string[1:1])

It would seem that they should have same results at a glance.

unicode_string[1] returns a Char which is always 4 bytes and unicode_string[1:1] returns a String which is UTF-8 encoded and thus can have 1, 2, 3 or 4 bytes depending on the character it contains.