Dot calls / broadcasts for strings

I didn’t found a reasons, except that String is considered to be a scalar type.
See e.g. How to check if a variable is scalar in julia - Stack Overflow
or 2 - Data types - Julia language: a concise tutorial
But by the definition of isscalar using Broadcast it is a circular argument. I would say, defining .== for Strings would be doable, except that I can’t foresee the side effects.

split(a, "") .== split(b, "")

This is probably not what you want to do, because split returns an array of Substring.
You are looking for:

getindex.(a,1:length(a)) .== getindex.(b,1:length(b))

or

collect(a) .== collect(b)

which both compare Char and not Substring and which are 10x faster:

julia> @btime getindex.($a,1:length($a))
  28.715 ns (1 allocation: 96 bytes)

julia> @btime split($a, "")
  288.176 ns (10 allocations: 704 bytes)
3 Likes

It seems to me that this would make working with vectors of strings painful - mainly in DataFrames, where subsetting data by doing something like df[df.col1 .== "some string", :] would presumably error if the string is considered an iterator.

2 Likes

Should strings be vectors of chars?

a = "abcd"
b = "abxd"
Vector{Char}(a) .== Vector{Char}(b)

4-element BitVector:
 1
 1
 0
 1
2 Likes
julia> @code_lowered Vector{Char}(a)
CodeInfo(
1 ─ %1 = Base.collect($(Expr(:static_parameter, 1)), s)
└──      return %1
)

identical to collect(Char,a).

2 Likes

There is always this tension of what types are considered “scalar” vs “containers” for the purposes of broadcasting. For instance, Strings are “scalar” and structs are “containers” by default, even though sometimes you want structs to be scalar - in which case you use Ref - and sometimes you want strings to be containers. The weird part is that strings are iterable and structs are not (by default). That’s just life I’m afraid… maybe something to keep in mind for 2.0?

4 Likes

Thanks @oheil for the excellent performance tip! I actually thought collect would be slower because it didn’t return a view. I obviously should have measured instead of guessed.
Could you explain why the SubString view (it is a view, right?) makes it slower? I was expecting == to just perform read operations on that view and hence prevent the intermediate allocation of an array (which collect probably does, right?).

I’m not sure this was the reasoning behind the the current implementation, since DataFrames, unlike strings is not part of the standard library. But it is still a good point!

You are probably right, that this is not an easy to solve tradeoff, I still wonder why the decision was made to treat strings as “scalars” instead of containers for broadcasting.
Are you sure structs are treated as containers for broadcasts? Or did you mean arrays?
Also, Ref probably has no performance overhead, I’m not sure the same is true for collecting strings.

Every new type is assumed to be a “container” by default. If you want to opt out of this for a new type that you define, you define Base.broadcastable(x::MyType) = Ref(x). This has been done for strings and a bunch of other types, and packages should use it too.

Treating Strings as “atoms” seems to be much more common than treating them as containers of characters (e.g. consider strings as filenames, as labels in a dataframe, as keys in a dictionary…). Also, keep in mind that “characters” are often an ambiguous way to think about the contents of strings. For example, "Raphaël" and "Raphaël" are canonically equivalent strings, but they don’t contain the same Chars (or even the same number of Chars).

(And the non-consecutive indexing of String—which is crucial for efficient access with a variable-length encoding like Julia’s UTF-8—also makes them technically difficult to treat as an AbstractVector{Char} in broadcast or anywhere else.)

12 Likes

The SubString isn’t that what makes it slower.
But split fills up a Vector with chunks of SubString-views by traversing through the complete String searching for the next occurence of the split character. Thats how it looks like:

julia> @less split(a,"")

...
function _split(str::AbstractString, splitter::F, limit::Integer, keepempty::Bool, strs::Vector) where F
    # Forcing specialization on `splitter` improves performance (roughly 30% decrease in runtime)
    # and prevents a major invalidation risk (1550 MethodInstances)
    i = 1 # firstindex(str)
    n = lastindex(str)::Int
    r = findfirst(splitter,str)::Union{Nothing,Int,UnitRange{Int}}
    if !isnothing(r)
        j, k = first(r), nextind(str,last(r))::Int
        while 0 < j <= n && length(strs) != limit-1
            if i < k
                if keepempty || i < j
                    push!(strs, @inbounds SubString(str,i,prevind(str,j)::Int))
                end
                i = k
            end
            (k <= j) && (k = nextind(str,j)::Int)
            r = findnext(splitter,str,k)::Union{Nothing,Int,UnitRange{Int}}
            isnothing(r) && break
            j, k = first(r), nextind(str,last(r))::Int
        end
    end
    if keepempty || i <= ncodeunits(str)::Int
        push!(strs, @inbounds SubString(str,i))
    end
    return strs
end

Doesn’t look so fast, isn’t it?

What I find jarring is the fact that strings are iterable but scalar-like for broadcast, and structs by default are not iterable but container-like for broadcast.

I don’t use strings much so have no opinion on them, but I’ve often wished broadcast on structs was scalar by default. It seems to me that broadcasting really only makes sense for homogeneous structs (containing objects of the same type), which is the exception rather the norm.

4 Likes

Scratching my head to understand this:

Is there a typo? Thank you for shedding some more light.

Later edit:
Your other post here helps understanding this a bit further.

I had the same question marks in head, but copy&paste did it:

julia> "Raphaël"
"Raphaël"

julia> "Raphaël"
"Raphaël"

julia> collect("Raphaël")
8-element Vector{Char}:
 'R': ASCII/Unicode U+0052 (category Lu: Letter, uppercase)
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'p': ASCII/Unicode U+0070 (category Ll: Letter, lowercase)
 'h': ASCII/Unicode U+0068 (category Ll: Letter, lowercase)
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'e': ASCII/Unicode U+0065 (category Ll: Letter, lowercase)
 '̈': Unicode U+0308 (category Mn: Mark, nonspacing)
 'l': ASCII/Unicode U+006C (category Ll: Letter, lowercase)

julia> collect("Raphaël")
7-element Vector{Char}:
 'R': ASCII/Unicode U+0052 (category Lu: Letter, uppercase)
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'p': ASCII/Unicode U+0070 (category Ll: Letter, lowercase)
 'h': ASCII/Unicode U+0068 (category Ll: Letter, lowercase)
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'ë': Unicode U+00EB (category Ll: Letter, lowercase)
 'l': ASCII/Unicode U+006C (category Ll: Letter, lowercase)
5 Likes

OMG! :face_with_hand_over_mouth:

Character encoding is a nightmare.

3 Likes

Thanks for all the great answers!

In conclusion, do you think it’s fair to say that, while the behavior mentioned in my OP may be unexpected to some, treating Strings as containers for the purpose of broadcasting would probably lead to confusion in much more cases?

I hope that in 2.0 strings can become non-iterable. There are too many things I could plausibly mean by “elements of a string”, but most often I mean nothing at all – when I iterate over a string it’s usually just a mistake and I meant to iterate over a collection of strings.

3 Likes

4 posts were split to a new topic: Char and broadcastable

This is a good idea: writing chars(str) and getting a StringChars object wrapping a string would be better. Note, however that until Julia 1.5 this would have forced a heap allocation just to iterate the characters of a string, so while this would be a good change for 2.0, there are reasons it was this way in 1.0. That also opens the question of what one indexes a StringChars object with: code units would be the efficient answer but seems not to be what people expect.

4 Likes

Having an inefficient StringChars to complement our nice efficient CodeUnits would be really nice!

Unicode.graphemes provides the closest to what we interpret as the iterable elements of a string, and it works for the “Raphaël” example. Working with graphemes is much slower than fixed-size characters, but I don’t think this particular wheel needs reinvention!

julia> graphemes("Raphaël") .== graphemes("Raphaël")
7-element BitVector:
 1
 1
 1
 1
 1
 0
 1
1 Like

The issue is that what one wants is what our strings do: indexing by code units and getting characters. My inclination is that there should be codeunits(str), chars(str) and graphemes(str) and they should all be indexed by code unit, but maybe wrapped in a StringIndex type that you either can’t just do bare arithmetic with or that remembers the string the index is into and does the right advancing when you do arithmetic with it.

5 Likes