Why is AbstractString not a subtype of AbstractVector?

While working on https://github.com/JuliaCollections/DataStructures.jl/pull/759, which I thought would be a funny solution to a project euler riddle, I noticed that String is not a subtype of AbstractVector{Char}. Are there good reasons against making AbstractString <: AbstractVector? Or even <: AbstractVector{AbstractChar}?

Not an answer, but this seems related to your last question: Vector{Int} <: Vector{Real} is false??? · JuliaNotes.jl

The <: behaves sometimes a little bit different with regard to the element types…

There are various reasons for considering strings something totally different from vectors of chars. For me, an important one is vectors are mutable objects, whereas strings are not. Also, indexing the characters of a string in the fashion of indexing vectors (i.e. x[i] with i being an integer) gives wrong results in strings with Unicode characters.

8 Likes

another big problem with Strings compared to Vectors is that not all indices in the range begin:end are valid:

julia> a="∈b"
"∈b"

julia> a[2]
ERROR: StringIndexError: invalid index [2], valid nearby indices [1]=>'∈', [4]=>'b'
Stacktrace:
 [1] string_index_err(s::String, i::Int64)
   @ Base ./strings/string.jl:12
 [2] getindex_continued(s::String, i::Int64, u::UInt32)
   @ Base ./strings/string.jl:233
 [3] getindex(s::String, i::Int64)
   @ Base ./strings/string.jl:226
 [4] top-level scope
   @ REPL[5]:1

(nextind(a,1) should be used instead)

6 Likes

Another difference: Strings are treated as scalars by broadcasting:

jl> println.("Hello, ", 1:7);
Hello, 1
Hello, 2
Hello, 3
Hello, 4
Hello, 5
Hello, 6
Hello, 7

vs

jl> println.(['H', 'e', 'l', 'l', 'o', ',', ' '], 1:7);
H1
e2
l3
l4
o5
,6
 7
7 Likes

Not necessarily: ranges like 1:10 are mutable immutable AbstractVectors, as are SVectors from StaticArrays.jl and many other examples.

I think the main technical reason that AbstractString is not a subtype of AbstractVector is that the indices of a String are not necessarily consecutive, and in consequence there is no O(1) algorithm to “give me the n-th character of a string” (str[nextind(str, 0, n)] in Julia).

Conceptually, however, one rarely views a String as a collection of characters, in part because the concept of a “character” is itself ambiguous in Unicode. For example, "no\u00EBl"== "noël" and "noe\u0308l" == "noël" are canonically equivalent strings in Unicode, but the former has length 5 and the latter has length 4 (i.e., different numbers of codepoints, depending on whether a combining character is used to make the ë). For a similar reason, it’s not generally useful to ask for the “n-th character of a string” where n is chosen at random.

(It is useful to be able to read from an index that was located previously, e.g. by a find function, but in that case the index is just an arbitrary position indicator in the string and you don’t care how many characters it corresponds to.)

(Conversely, codeunits(str) for a string str does give a subtype of AbstractVector, but the elements of this array are not characters, but rather the elementary units of the unicode encoding of the string—bytes, for String with its UTF-8 encoding.)

19 Likes

Nit: immutable

Thanks, fixed.

Thanks for update and quick response, it’s too helpful.

1 Like

Just wondered what is the best way to convert between string and vector of characters:

julia> s = "toto"
"toto"

julia> vc = collect(s)
4-element Vector{Char}:
 't': ASCII/Unicode U+0074 (category Ll: Letter, lowercase)
 'o': ASCII/Unicode U+006F (category Ll: Letter, lowercase)
 't': ASCII/Unicode U+0074 (category Ll: Letter, lowercase)
 'o': ASCII/Unicode U+006F (category Ll: Letter, lowercase)

julia> s = String(vc)
"toto"
2 Likes

Those are good ways to do it. Note that collecting a string into a vector of characters potentially blows up the storage by up to 4x.

3 Likes