Is Julia well-suited for string manipulation?

jules · January 13, 2023, 4:43pm

Why do I sometimes read that Julia is not particularly well suited for string manipulation? I read about Perl or Python being good at that but I’ve never seen a technical explanation what puts Julia at a disadvantage performance-wise.

stevengj · January 13, 2023, 5:25pm

I don’t think Julia is at an inherent disadvantage — you can certainly write heavily optimized text-processing code like CSV.jl and Parsers.jl in Julia. For example, a few years back I helped optimize a Japanese tokenizer called TinySegmenter.jl, and the result was considerably faster than optimized implementations of a similar algorithm in Ruby, Python, and Perl.

However, if you write naive string-processing code in Julia, using the same style as Python, allocating zillions of temporary strings, then Julia performs worse — the language isn’t designed for code that does lots of small allocations in critical inner loops. It’s similar to people who port numerical code from Matlab (or Numpy) to Julia line-by-line, and often find that their initial Julia port is slower, as in this recent thread for example: MATLAB outperforms Julia (20 times faster) running this nested loop

Of course, for particular tasks a particular language may benefit from some heavily optimized library, for which Julia does not yet have an equivalent. It’s also true that the majority of people writing high-performance libraries for Julia have thus far been focused more on numerical computations than text processing.

jules · January 13, 2023, 5:29pm

This part sounds interesting, I wouldn’t have expected a language like Python to perform better in anything related to critical inner loops. So is the underlying reason that our garbage collector doesn’t perform well when you allocate lots of small strings that are discarded quickly? Is reference counting better in that situation?

Oscar_Smith · January 13, 2023, 5:34pm

Reference counting is pretty hard to beat in this type of scenario (although better escape analysis certainly could help).

jling · January 13, 2023, 5:59pm

the architecture of our GC may be categorically worse compared to Reference counting when it comes to this allocation pattern, but our specific implementation certainly had improved (maybe still more room?), for example The GC often doesn't act generational · Issue #40644 · JuliaLang/julia · GitHub used to be really bad

jar1 · January 13, 2023, 7:58pm

@jakobnissen has pointed out a number of missing string features relative to Rust. Nonallocating functions like Add method `split(str, dlm, ::Val{N})` for allocation-free splitting by jakobnissen · Pull Request #43557 · JuliaLang/julia · GitHub and String search operations needlessly allocate · Issue #45393 · JuliaLang/julia · GitHub. Jakob may remember other problems.

stevengj · January 13, 2023, 8:18pm

Saying “I wish Base had function foo built-in” is totally different from saying “there’s something wrong with strings in Julia”. For example, the split_n function you linked can be implemented by the user with a single line of Julia code if needed.

jakobnissen · January 13, 2023, 8:18pm

I’ve found it difficult to write efficient string code in Julia. It’s mostly not due to bad design of Julia’s strings, but instead just a lack of non-allocating or ASCII-only operations. A few things off the top of my head:

It’s difficult to read a line from a file without allocating
There are no mutable strings
There is no non-allocating split_once function
Non-allocating split was added in 1.8 only. It still allocates for substring search, I believe
There is no way to split a string in its lines, without allocating

I believe performant libraries that process strings in Julia, like CSV.jl and JSON.jl in fact work on raw byte array directly for most applications in lieu of actually performant Julia operations.

In my work, I use strings about as frequently as arrays, and to me it’s painfully obvious how much more energy has gone into the huge amount of arrays in Julia compared to strings. Imagine trying to write efficient numerical Julia code if there were no mutable arrays, and many array-related functions needlessly allocated.

stevengj · January 13, 2023, 8:19pm

GitHub - JuliaStrings/StringViews.jl: String-like views of arbitrary Julia byte arrays are mutable (by mutating the underlying byte array). Mutating a UTF-8 encoded string is something that mostly makes sense for ASCII characters, of course, i.e. working with bytes.

Iterating over eachsplit(s, '\n') doesn’t allocate.

It would be nice to add a readuntil! and/or readline! method that reads into a pre-allocated byte array (which you could then use with e.g. a StringView).

jakobnissen · January 13, 2023, 8:34pm

Iterating over eachsplit(s, '\n') doesn’t allocate.

Right, but you’d also need to be able to split on "\r\n". However, I just checked, and it seems if the regex is known at compile time, it doesn’t allocate to split on that, so I was mistaken.

stevengj · January 14, 2023, 1:55am

Update: laid some groundwork in copyuntil(out::IO, in::IO, delim) by stevengj · Pull Request #48273 · JuliaLang/julia · GitHub

Update: I forgot that there is already at least one package for in-place line reading based on StringViews and buffers: GitHub - rickbeeloo/ViewReader

jlapeyre · January 17, 2023, 6:46pm

I find that I have to use code units if I want to work efficiently with ASCII strings.
Say I want to count the number of '1's in a string of '1's and '0's.

import Random
s = Random.randstring(('0', '1'), 100)
count_ones_a(s) = sum(x -> x === '1', s)
count_ones_b(s) = sum(x -> x === UInt8('1'), codeunits(s))

julia> @btime count_ones_a($s)
  70.474 ns (0 allocations: 0 bytes)
44

julia> @btime count_ones_b($s)
  16.348 ns (0 allocations: 0 bytes)
44

This is part of the Julia API. But, it’s not really advertised or organized as a way to work with ASCII strings (AFAIK).

I’m working on a library that treats different representations of binary sequences with a common API. For strings, I always do things like in example above. Another example:

github.com

jlapeyre/BitsX.jl/blob/720dccacee358a5579d4e8889b124a7a78dede94/src/bits.jl#L496-L516


      
          """
              normalize_bitstring(str::AbstractString)
          
          Remove all characters (more precisely, code points) from `str` that are not one
          of `'0'` and `'1'`, if such characters exist. Otherwise, return `str`
          unchanged. Non-ASCII characters in `str` may cause incorrect results.
          """
          function normalize_bitstring(str::AbstractString)
              nbits = count_bits(str)
              nbits == ncodeunits(str) && return str
              v = Base.StringVector(nbits) # Not exported, but fast
              j = 1
              for i in eachindex(codeunits(str))
                  c = @inbounds codeunit(str, i)
                  if is_binary_char(c)
                      @inbounds v[j] = c
                      j += 1
                  end
              end
              return String(v)
          end

EDIT: But the OP was comparing to perl and python. I have seen python perform well compared to Julia with ASCII strings. But I don’t recall the context or details. In the example above, my best attempt with Python is still slower than with julia

In [35]: %timeit s.count('1')
131 ns ± 1.36 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [36]: sb = bytes(s, 'ascii')

In [37]: %timeit sb.count(b'1')
124 ns ± 0.729 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [38]: %timeit bytes(s, 'ascii').count(b'1')
211 ns ± 1.35 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

The second one is cheating because the input is not a plain string str. The last one is slower because it allocates. I don’t know how to get a non-allocating view of the code units in a string in Python.

A bigger lesson here is that in order to try to do this with Python, my only hope is to use builtin functions (or library with C extensions). In Julia I can use something built in, or write any kind of low-level code I like (or in this case, use a the nice high-level interface offered by sum)

jar1 · January 17, 2023, 7:35pm

For ascii strings, is there any reason not to use ASCIIStr from GitHub - JuliaString/Strs.jl: String support package for Julia?

stevengj · January 17, 2023, 7:44pm

Several functions have special implementations if you search for ASCII characters. e.g. findnext(==('1'), s) will use a byte-search method since 1 is ASCII. It would be easy to add a specialized method for `count(==(‘1’), s) and similar, but of course it’s impossible to cover all possible cases.

jlapeyre · January 17, 2023, 8:32pm

For the record (refer to @stevengj 's comment above):

x -> x === '1' is not slower than ==('1'). In this case (we are comparing characters). If I were searching for matches in a vector of strings, I would expect === to be faster. This is what I’ve seen in similiar situations in the past. But, just now in v1.9, counting ocurrences of "cat" in a Vector of "dog" and "cat" is slower with === than ==. shrug_emoji.
count(==('1'), s) is 25% faster than sum(==('1'), s) according to a couple of tests.
The doc string in the function normalize_bitstring above has an error. The function will work for strings containing non-ascii characters. All bytes but the first in a code point have the high bit set to 1, so they can’t be confused with ascii characters. This is why findnext can make the optimization for an ascii character.

DNF · January 17, 2023, 9:49pm

And count(x -> x===(0x31), codeunits(s)) is similarly faster than the sum version.

jlapeyre · January 17, 2023, 11:07pm

Thanks. I tried all four, and inadvertently omitted mentioning those:

julia> count_ones_b(s) = sum(==(UInt8('1')), codeunits(s));


julia> count_ones_d(s) = count(==(UInt8('1')), codeunits(s));

julia> @btime count_ones_b($s);
  17.567 ns (0 allocations: 0 bytes)

julia> @btime count_ones_d($s);
  14.220 ns (0 allocations: 0 bytes)

aplavin · January 17, 2023, 11:28pm

Using codeunits, performance doesn’t suffer when a specialized implementation is not available or not possible at all (eg when the predicate is x -> x == 'a' and not ==('a')). So it’s often better to explicitly use codeunits when only ASCII is relevant, instead of looking for specific optimized methods.

stevengj · January 18, 2023, 3:55am

I agree. Note also that Julia has a b"foo" string macro for creating codeunit literals with string syntax, so this can make working with codeunit arrays more “string-like”.

So, for example, you can do bytes == b"foo" to compare a byte array to the bytes of "foo", or do append!(bytes, b"foo") to do “byte-string” concatenation in-place.

(I wish there were a similar compact syntax for getting the UInt8 value of 1-byte characters rather than UInt8('x'). I guess we could add a flag to b"..." syntax, like b"x"1 to get 1 byte, but that’s not particularly pretty either.)

jakobnissen · January 19, 2023, 10:01am

Rust has b'A', we could have that as well. It wouldn’t even need to be generic, i.e. implemented with @foo_char macro.

Topic		Replies	Views
Is it possible to see performant string processing without sacrificing ease of use in Julia one day? Performance question , strings	7	787	May 28, 2022
Julia can process text similar to Perl New to Julia question , strings	12	1908	March 20, 2020
Julia subsumes ICON: well done Internals & Design strings	3	746	June 25, 2021
Community string benchmark suite General Usage	3	152	October 1, 2024
String optimisation in Julia General Usage performance , strings , io	21	555	September 21, 2024

Is Julia well-suited for string manipulation?

Related topics