Dot calls / broadcasts for strings

fedario · May 15, 2021, 9:56am

I encountered some unexpected behavior regarding broadcasted functions for strings.

Let’s say we have two strings:

a = "abc"
b = "abx"

When checking for equality, the result was as I expected:

a == b
# result: false

When I wanted to check for the equality for each individual glyph using a broadcasted or dotted call, I was expecting to get a boolean vector giving the equality for each glyph.
That was not the case, however:

a .== b
# result: false 
# expected: [true, true, false]

When doing the same for arrays, the result is (close to) what I would expect:

x = [1, 2, 3]
y = [1, 2, 4]

x == y
# result: false

x .== y
# result: 3-element BitVector:
#          1
#          1
#          0

I ended up using this workaround which gave me the desired result:

split(a, "") .== split(b, "")
# result: 3-element BitVector:
#          1
#          1
#          0

I know that in Julia strings aren’t arrays of ASCII chars, but instead are more complex due to Unicode encodings. However, indexing into strings does mirror the array indexing interface, giving individual glyphs as a result, even though they may correspond to multiple bytes behind the scene.

My questions are:

What is the reason for strings behaving this way for dotted (equality) calls?
Wouldn’t it be advantageous if == and .== did behave the same for strings as they do for arrays?

julia version 1.6.1

oheil · May 15, 2021, 10:58am

I didn’t found a reasons, except that String is considered to be a scalar type.
See e.g. How to check if a variable is scalar in julia - Stack Overflow
or https://syl1.gitbook.io/julia-language-a-concise-tutorial/language-core/data-types#scalar-types
But by the definition of isscalar using Broadcast it is a circular argument. I would say, defining .== for Strings would be doable, except that I can’t foresee the side effects.

split(a, "") .== split(b, "")

This is probably not what you want to do, because split returns an array of Substring.
You are looking for:

getindex.(a,1:length(a)) .== getindex.(b,1:length(b))

or

collect(a) .== collect(b)

which both compare Char and not Substring and which are 10x faster:

julia> @btime getindex.($a,1:length($a))
  28.715 ns (1 allocation: 96 bytes)

julia> @btime split($a, "")
  288.176 ns (10 allocations: 704 bytes)

nilshg · May 15, 2021, 11:08am

It seems to me that this would make working with vectors of strings painful - mainly in DataFrames, where subsetting data by doing something like df[df.col1 .== "some string", :] would presumably error if the string is considered an iterator.

rafael.guerra · May 15, 2021, 11:11am

Should strings be vectors of chars?

a = "abcd"
b = "abxd"
Vector{Char}(a) .== Vector{Char}(b)

4-element BitVector:
 1
 1
 0
 1

oheil · May 15, 2021, 11:17am

julia> @code_lowered Vector{Char}(a)
CodeInfo(
1 ─ %1 = Base.collect($(Expr(:static_parameter, 1)), s)
└──      return %1
)

identical to collect(Char,a).

antoine-levitt · May 15, 2021, 12:08pm

There is always this tension of what types are considered “scalar” vs “containers” for the purposes of broadcasting. For instance, Strings are “scalar” and structs are “containers” by default, even though sometimes you want structs to be scalar - in which case you use Ref - and sometimes you want strings to be containers. The weird part is that strings are iterable and structs are not (by default). That’s just life I’m afraid… maybe something to keep in mind for 2.0?

fedario · May 15, 2021, 12:55pm

Thanks @oheil for the excellent performance tip! I actually thought collect would be slower because it didn’t return a view. I obviously should have measured instead of guessed.
Could you explain why the SubString view (it is a view, right?) makes it slower? I was expecting == to just perform read operations on that view and hence prevent the intermediate allocation of an array (which collect probably does, right?).

I’m not sure this was the reasoning behind the the current implementation, since DataFrames, unlike strings is not part of the standard library. But it is still a good point!

You are probably right, that this is not an easy to solve tradeoff, I still wonder why the decision was made to treat strings as “scalars” instead of containers for broadcasting.
Are you sure structs are treated as containers for broadcasts? Or did you mean arrays?
Also, Ref probably has no performance overhead, I’m not sure the same is true for collecting strings.

stevengj · May 15, 2021, 12:57pm

Every new type is assumed to be a “container” by default. If you want to opt out of this for a new type that you define, you define Base.broadcastable(x::MyType) = Ref(x). This has been done for strings and a bunch of other types, and packages should use it too.

Treating Strings as “atoms” seems to be much more common than treating them as containers of characters (e.g. consider strings as filenames, as labels in a dataframe, as keys in a dictionary…). Also, keep in mind that “characters” are often an ambiguous way to think about the contents of strings. For example, "Raphaël" and "Raphaël" are canonically equivalent strings, but they don’t contain the same Chars (or even the same number of Chars).

(And the non-consecutive indexing of String—which is crucial for efficient access with a variable-length encoding like Julia’s UTF-8—also makes them technically difficult to treat as an AbstractVector{Char} in broadcast or anywhere else.)

oheil · May 15, 2021, 1:11pm

The SubString isn’t that what makes it slower.
But split fills up a Vector with chunks of SubString-views by traversing through the complete String searching for the next occurence of the split character. Thats how it looks like:

julia> @less split(a,"")

...
function _split(str::AbstractString, splitter::F, limit::Integer, keepempty::Bool, strs::Vector) where F
    # Forcing specialization on `splitter` improves performance (roughly 30% decrease in runtime)
    # and prevents a major invalidation risk (1550 MethodInstances)
    i = 1 # firstindex(str)
    n = lastindex(str)::Int
    r = findfirst(splitter,str)::Union{Nothing,Int,UnitRange{Int}}
    if !isnothing(r)
        j, k = first(r), nextind(str,last(r))::Int
        while 0 < j <= n && length(strs) != limit-1
            if i < k
                if keepempty || i < j
                    push!(strs, @inbounds SubString(str,i,prevind(str,j)::Int))
                end
                i = k
            end
            (k <= j) && (k = nextind(str,j)::Int)
            r = findnext(splitter,str,k)::Union{Nothing,Int,UnitRange{Int}}
            isnothing(r) && break
            j, k = first(r), nextind(str,last(r))::Int
        end
    end
    if keepempty || i <= ncodeunits(str)::Int
        push!(strs, @inbounds SubString(str,i))
    end
    return strs
end

Doesn’t look so fast, isn’t it?

antoine-levitt · May 15, 2021, 1:11pm

What I find jarring is the fact that strings are iterable but scalar-like for broadcast, and structs by default are not iterable but container-like for broadcast.

I don’t use strings much so have no opinion on them, but I’ve often wished broadcast on structs was scalar by default. It seems to me that broadcasting really only makes sense for homogeneous structs (containing objects of the same type), which is the exception rather the norm.

rafael.guerra · May 15, 2021, 1:14pm

Scratching my head to understand this:

Is there a typo? Thank you for shedding some more light.

Later edit:
Your other post here helps understanding this a bit further.

oheil · May 15, 2021, 1:15pm

I had the same question marks in head, but copy&paste did it:

julia> "Raphaël"
"Raphaël"

julia> "Raphaël"
"Raphaël"

julia> collect("Raphaël")
8-element Vector{Char}:
 'R': ASCII/Unicode U+0052 (category Lu: Letter, uppercase)
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'p': ASCII/Unicode U+0070 (category Ll: Letter, lowercase)
 'h': ASCII/Unicode U+0068 (category Ll: Letter, lowercase)
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'e': ASCII/Unicode U+0065 (category Ll: Letter, lowercase)
 '̈': Unicode U+0308 (category Mn: Mark, nonspacing)
 'l': ASCII/Unicode U+006C (category Ll: Letter, lowercase)

julia> collect("Raphaël")
7-element Vector{Char}:
 'R': ASCII/Unicode U+0052 (category Lu: Letter, uppercase)
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'p': ASCII/Unicode U+0070 (category Ll: Letter, lowercase)
 'h': ASCII/Unicode U+0068 (category Ll: Letter, lowercase)
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'ë': Unicode U+00EB (category Ll: Letter, lowercase)
 'l': ASCII/Unicode U+006C (category Ll: Letter, lowercase)

rafael.guerra · May 15, 2021, 1:17pm

OMG!

oheil · May 15, 2021, 1:18pm

Character encoding is a nightmare.

fedario · May 15, 2021, 6:03pm

Thanks for all the great answers!

In conclusion, do you think it’s fair to say that, while the behavior mentioned in my OP may be unexpected to some, treating Strings as containers for the purpose of broadcasting would probably lead to confusion in much more cases?

jzr · May 15, 2021, 6:18pm

I hope that in 2.0 strings can become non-iterable. There are too many things I could plausibly mean by “elements of a string”, but most often I mean nothing at all – when I iterate over a string it’s usually just a mistake and I meant to iterate over a collection of strings.

stevengj · May 16, 2021, 1:01am

4 posts were split to a new topic: Char and broadcastable

StefanKarpinski · May 19, 2021, 7:31pm

This is a good idea: writing chars(str) and getting a StringChars object wrapping a string would be better. Note, however that until Julia 1.5 this would have forced a heap allocation just to iterate the characters of a string, so while this would be a good change for 2.0, there are reasons it was this way in 1.0. That also opens the question of what one indexes a StringChars object with: code units would be the efficient answer but seems not to be what people expect.

Mason · May 19, 2021, 7:47pm

Having an inefficient StringChars to complement our nice efficient CodeUnits would be really nice!

stillyslalom · May 19, 2021, 8:21pm

Unicode.graphemes provides the closest to what we interpret as the iterable elements of a string, and it works for the “Raphaël” example. Working with graphemes is much slower than fixed-size characters, but I don’t think this particular wheel needs reinvention!

julia> graphemes("Raphaël") .== graphemes("Raphaël")
7-element BitVector:
 1
 1
 1
 1
 1
 0
 1

Topic		Replies	Views
Dot notation General Usage	5	8698	October 22, 2020
Basic string iteration New to Julia	6	3826	June 17, 2021
Broadcasting on Vector{SubString{String}} does not work in Julia 1.7.0 rc1? General Usage strings , broadcasting	12	1908	October 16, 2021
Broadcasting string interpolation Internals & Design broadcast , string-interpolation	2	635	March 17, 2022
Generalizing Dot Syntax to Broadcast Operators for Arbitrary Types New to Julia	3	862	June 20, 2019

Dot calls / broadcasts for strings

Related topics