Why are Julia strings immutable?

The question I have (after coming here, apparently VERY late), is WHY are strings immutable?

Meaning, I’m assuming this either means that Julia doesn’t implement strings as arrays of characters, or there is a reason other than data structure that is underlying the reason for their immuatbility.

3 Likes

See the explanation here: Why are strings immutable (historical records request)

Julia doesn’t implement strings as arrays of characters

It doesn’t, it implements them internally as arrays of bytes in a UTF-8 encoding (but this internal array is hidden in Julia internals so that you cannot mutate it). This means that ASCII characters take only 1 byte per character, but other Unicode characters may take 2–4 bytes. In contrast, implementing strings as arrays of characters would take 4 bytes for every character for not much practical benefit.

(You might object that arrays of characters would allow you to query the n-th character quickly, or substitute one character for another in-place, but these operations turn out to be not so useful as you might think in practice, in part because Unicode is much more complicated than most people realize. See also utf8everywhere.org for more discussion of string encoding and why UTF-8 is a choice that many people have adopted.)

The StringViews.jl package implements an AbstractString type that is also UTF-8, but which can be backed by a mutable array of bytes. With this type, one could conceivably implement an in-place filter! function on a StringView. (Deleting characters in-place during iteration is very straightforward in the UTF-8 encoding, despite the variable character width.)

(And, of course, it is perfectly possible to implement a new AbstractString subtype backed by an array of Char. The LegacyStrings.jl package includes a related string type in the UTF-32 encoding, which is an array of Unicode characters identified by codepoint. It’s not something that seems to get much practical usage, however.)

26 Likes

That strings are immutable in Julia has always struck me as amusingly ironic, as usually I write programs to mutate strings (and mutate them a lot).

One pattern is the loop where you’re building up a long string by concatenating smaller strings and separators to the end of it. Another is the search and replace within a string.

It turns out not to impair one’s ability to mutate strings by strings being immutable–you just program it a little differently. Or you program it the slow and highly allocating way, but don’t care because it’s done in less than a millisecond anyhow.

2 Likes

You probably know this, but for the benefit of other readers I will note that this kind of task is generally accomplished in Julia with an IOBuffer, or equivalently with functions like join that internally use an IOBuffer. e.g.

io = IOBuffer()
for i = 1:5
    print(io, "something…")
    print(io, "something else…")
end
s = String(take!(io)) # get the concatenated string
17 Likes

I have an implementation of a string case converter (snake, pascal, camel, kebab, title, you name it) and I now plan on implementing IOBuffer everywhere in it instead.

It’s currently implemented as @blackeneth described - a lot of splicing and splitting of strings stored in vectors and variables. But also as they said, it’s super fast anyway.

And sidenote, StringCases.jl didn’t suit my use case, though I did consider contributing to it.

In many languages strings are immutable to facilitate string interning:

Though this isn’t the case in Julia AFAIK, at least not with String.

3 Likes

This is a general design question whenever you have a datatype: Do you want it to behave like an immutable value (e.g. Int, BigInt or julia String) or like a mutable thing like RefValue{Int} or Vector{Int}?

We can list some advantages and disadvantages of immutablity:

  1. If the data is fixed size and smallish, i.e. fits into a register, then immutability has significant performance advantages.
  2. If the data is not fixed size and smallish, then we are priced into using a reference / heap-allocation.
  3. If the type is immutable, and we want mutations to be shared visible in many places, then we need to wrap, like e.g. RefValue{Int} or RefValue{String}.
  4. If the type is mutable and we do not want mutations to be shared visible, then we need to take copies.
  5. If the type is immutable, then it can become bothersome to iteratively build instances: The naive approach will litter the heap with temporary allocations. E.g. building strings by str = prefix * str * postfix in julia, or str = prefix + str + postfix in most languages. The solution is to have a mutable builder that eventually obtains the result in one go – this is iobuffer in julia and e.g. stringbuilder / stringbuffer in java. Indeed, code like str = a + b + c automatically uses stringbuilders in java!

Now (3) and (4) are obviously in conflict. So one needs to consider how the type is typically used: Is there more focus on read-heavy value-style semantics, or is there more focus on construction? Do we want code like someDict[key] = value to have endless discussions on whether to pull a copy of string keys?

You can immediately understand why the end result comes down to “Array has mutable/identity semantics by default, String and BigInt have immutable value-type semantics by default”.

Julia string has a small difference as opposed to julia bigInt: BigInt is actually mutable, mutation is just discouraged. Part of it is that String is a built-in, while BigInt is just a standard julia type (without specific runtime/compiler/language support).

I think it boils down to: There would be no appreciable upside to making string “mutable in a pinch” like BigInt, since iobuffer is needed for iterative constructions anyways, but there probably would be some downsides.

2 Likes

String perhaps is “mutable in a pinch”. E.g., BigFloat kinda stores the significand in a String (hackery):

2 Likes

I am somewhat surprised, but this bears out. I wonder whether it is intentional or a missed optimization?

I.e. consider

julia> function foo(s, ss, func::Base.RefValue{Any})
       cmp1 = s === ss
       func[]()
       cmp2 = s === ss
       return cmp1 == cmp2
       end

Would it be admissible for the compiler to infer that this always evaluates to true (if it returns at all)? I.e. is jl_egal a pure function?

I think it should be pure, which would make string mutation hard UB as long as equal strings are egal (===)!

But my julia doesn’t optimize that, and maybe egal should not be a pure function after all.

It can do that in certain circumstances. Note the ret i8 1, i.e. return the constant 1 (which is equivalent to true here) at the end of code_llvm:

; Function Signature: foo(Int64, Int64, Base.RefValue{Any})
;  @ REPL[1]:1 within `foo`
define i8 @julia_foo_1525(i64 signext %"s::Int64", i64 signext %"ss::Int64", ptr noundef nonnull align 8 dereferenceable(8) %"func::RefValue") #0 {
top:
  %gcframe1 = alloca [3 x ptr], align 16
  call void @llvm.memset.p0.i64(ptr align 16 %gcframe1, i8 0, i64 24, i1 true)
  %thread_ptr = call ptr asm "movq %fs:0, $0", "=r"() #8
  %tls_ppgcstack = getelementptr i8, ptr %thread_ptr, i64 -8
  %tls_pgcstack = load ptr, ptr %tls_ppgcstack, align 8
  store i64 4, ptr %gcframe1, align 16
  %frame.prev = getelementptr inbounds ptr, ptr %gcframe1, i64 1
  %task.gcstack = load ptr, ptr %tls_pgcstack, align 8
  store ptr %task.gcstack, ptr %frame.prev, align 8
  store ptr %gcframe1, ptr %tls_pgcstack, align 8
;  @ REPL[1]:3 within `foo`
; ┌ @ refvalue.jl:59 within `getindex`
; │┌ @ Base.jl:49 within `getproperty`
    %"func::RefValue.x" = load atomic ptr, ptr %"func::RefValue" unordered, align 8
    %.not = icmp eq ptr %"func::RefValue.x", null
    br i1 %.not, label %fail, label %pass

fail:                                             ; preds = %top
    %jl_undefref_exception = load ptr, ptr @jl_undefref_exception, align 8
    call void @ijl_throw(ptr nonnull %jl_undefref_exception)
    unreachable

pass:                                             ; preds = %top
    %gc_slot_addr_0 = getelementptr inbounds ptr, ptr %gcframe1, i64 2
    store ptr %"func::RefValue.x", ptr %gc_slot_addr_0, align 16
; └└
  %0 = call nonnull ptr @ijl_apply_generic(ptr nonnull %"func::RefValue.x", ptr null, i32 0)
  %frame.prev6 = load ptr, ptr %frame.prev, align 8
  store ptr %frame.prev6, ptr %tls_pgcstack, align 8
;  @ REPL[1]:5 within `foo`
  ret i8 1
}

It can’t eliminate the call to func[]() because of the Any, which is all of the other stuff. After all, that call may modify some invisible state.

1 Like

That was my intention in the litmus test: The call cannot be removed because of side-effects; and if the julia compiler considers egal on strings as pure, then it could infer that the return value as true, but if it has to accomodate mutations, then it needs to re-do the string comparison.

Does your julia know that this returns true if s and ss are strings?

The compiler knows that ints are immutable, and I’m trying to figure out whether the compiler “knows” that strings are immutable (i.e. are strings immutable-by-convention or immutable-by-compiler-assumptions / UB).

It’s even better in Julia. str = a * b * c calls a multi-argument * method that results in a single call to string(a, b, c), which calls a specialized string-concatenation method that should be faster than writing to an IOBuffer because it computes the final size in advance rather than iteratively growing a buffer.

(Of course, maybe Java implementations have a similar optimization rather than literally using a StringBuilder instance?)

I love that it’s (usually) so easy in Julia to trace through the implementation of a “basic” operation like *, which in most other languages would be buried in compiler internals, e.g. simply by typing @edit "a" * "b" * "c".

14 Likes

As far as I know, it does know, but Strings are currently explicitly excluded from such const evals during compilation. See e.g. Concatenation of string constants isn't concretely evaluated, but is inferred as a `Const` · Issue #54921 · JuliaLang/julia · GitHub for another example where const eval of String would in principle be beneficial.

I couldn’t get Cthulhu.jl to work on my laptop, so I can’t check whether foo would infer as Const(true).

1 Like

You could have extra capacity immutable strings, and still allow mutating them for concatenation only, i.e. your new and old string would point to the same heap object and add to it. The pointers would only be accommodated by different lengths. [This also means strings can’t be zero terminated by default, but for Cstring, they can be had by one such mutable 0 addition.]

I think that’s enough for you, and if not, then it’s rather useless to mutate strings in general? It doesn’t work for UTF-8, so you would be assuming an ASCII subset for the full string. So I’m curious is it only important for you to concatenate, or to mutate in the string, and then only for ASCII strings?