Is Julia well-suited for string manipulation?

Why do I sometimes read that Julia is not particularly well suited for string manipulation? I read about Perl or Python being good at that but I’ve never seen a technical explanation what puts Julia at a disadvantage performance-wise.

1 Like

I don’t think Julia is at an inherent disadvantage — you can certainly write heavily optimized text-processing code like CSV.jl and Parsers.jl in Julia. For example, a few years back I helped optimize a Japanese tokenizer called TinySegmenter.jl, and the result was considerably faster than optimized implementations of a similar algorithm in Ruby, Python, and Perl.

However, if you write naive string-processing code in Julia, using the same style as Python, allocating zillions of temporary strings, then Julia performs worse — the language isn’t designed for code that does lots of small allocations in critical inner loops. It’s similar to people who port numerical code from Matlab (or Numpy) to Julia line-by-line, and often find that their initial Julia port is slower, as in this recent thread for example: MATLAB outperforms Julia (20 times faster) running this nested loop

Of course, for particular tasks a particular language may benefit from some heavily optimized library, for which Julia does not yet have an equivalent. It’s also true that the majority of people writing high-performance libraries for Julia have thus far been focused more on numerical computations than text processing.

7 Likes

This part sounds interesting, I wouldn’t have expected a language like Python to perform better in anything related to critical inner loops. So is the underlying reason that our garbage collector doesn’t perform well when you allocate lots of small strings that are discarded quickly? Is reference counting better in that situation?

1 Like

Reference counting is pretty hard to beat in this type of scenario (although better escape analysis certainly could help).

5 Likes

the architecture of our GC may be categorically worse compared to Reference counting when it comes to this allocation pattern, but our specific implementation certainly had improved (maybe still more room?), for example The GC often doesn't act generational · Issue #40644 · JuliaLang/julia · GitHub used to be really bad

@jakobnissen has pointed out a number of missing string features relative to Rust. Nonallocating functions like Add split_n function by jakobnissen · Pull Request #43557 · JuliaLang/julia · GitHub and String search operations needlessly allocate · Issue #45393 · JuliaLang/julia · GitHub. Jakob may remember other problems.

Saying “I wish Base had function foo built-in” is totally different from saying “there’s something wrong with strings in Julia”. For example, the split_n function you linked can be implemented by the user with a single line of Julia code if needed.

4 Likes

I’ve found it difficult to write efficient string code in Julia. It’s mostly not due to bad design of Julia’s strings, but instead just a lack of non-allocating or ASCII-only operations. A few things off the top of my head:

  • It’s difficult to read a line from a file without allocating
  • There are no mutable strings
  • There is no non-allocating split_once function
  • Non-allocating split was added in 1.8 only. It still allocates for substring search, I believe
  • There is no way to split a string in its lines, without allocating

I believe performant libraries that process strings in Julia, like CSV.jl and JSON.jl in fact work on raw byte array directly for most applications in lieu of actually performant Julia operations.

In my work, I use strings about as frequently as arrays, and to me it’s painfully obvious how much more energy has gone into the huge amount of arrays in Julia compared to strings. Imagine trying to write efficient numerical Julia code if there were no mutable arrays, and many array-related functions needlessly allocated.

4 Likes

GitHub - JuliaStrings/StringViews.jl: String-like views of arbitrary Julia byte arrays are mutable (by mutating the underlying byte array). Mutating a UTF-8 encoded string is something that mostly makes sense for ASCII characters, of course, i.e. working with bytes.

Iterating over eachsplit(s, '\n') doesn’t allocate.

It would be nice to add a readuntil! and/or readline! method that reads into a pre-allocated byte array (which you could then use with e.g. a StringView).

6 Likes

Iterating over eachsplit(s, '\n') doesn’t allocate.

Right, but you’d also need to be able to split on "\r\n". However, I just checked, and it seems if the regex is known at compile time, it doesn’t allocate to split on that, so I was mistaken.

3 Likes

Update: laid some groundwork in in-place `readuntil!` by stevengj · Pull Request #48273 · JuliaLang/julia · GitHub

Update: I forgot that there is already at least one package for in-place line reading based on StringViews and buffers: GitHub - rickbeeloo/ViewReader

19 Likes

I find that I have to use code units if I want to work efficiently with ASCII strings.
Say I want to count the number of '1's in a string of '1's and '0's.

import Random
s = Random.randstring(('0', '1'), 100)
count_ones_a(s) = sum(x -> x === '1', s)
count_ones_b(s) = sum(x -> x === UInt8('1'), codeunits(s))
julia> @btime count_ones_a($s)
  70.474 ns (0 allocations: 0 bytes)
44

julia> @btime count_ones_b($s)
  16.348 ns (0 allocations: 0 bytes)
44

This is part of the Julia API. But, it’s not really advertised or organized as a way to work with ASCII strings (AFAIK).

I’m working on a library that treats different representations of binary sequences with a common API. For strings, I always do things like in example above. Another example:

EDIT: But the OP was comparing to perl and python. I have seen python perform well compared to Julia with ASCII strings. But I don’t recall the context or details. In the example above, my best attempt with Python is still slower than with julia

In [35]: %timeit s.count('1')
131 ns ± 1.36 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [36]: sb = bytes(s, 'ascii')

In [37]: %timeit sb.count(b'1')
124 ns ± 0.729 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [38]: %timeit bytes(s, 'ascii').count(b'1')
211 ns ± 1.35 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

The second one is cheating because the input is not a plain string str. The last one is slower because it allocates. I don’t know how to get a non-allocating view of the code units in a string in Python.

A bigger lesson here is that in order to try to do this with Python, my only hope is to use builtin functions (or library with C extensions). In Julia I can use something built in, or write any kind of low-level code I like (or in this case, use a the nice high-level interface offered by sum)

3 Likes

For ascii strings, is there any reason not to use ASCIIStr from GitHub - JuliaString/Strs.jl: String support package for Julia?

Several functions have special implementations if you search for ASCII characters. e.g. findnext(==('1'), s) will use a byte-search method since 1 is ASCII. It would be easy to add a specialized method for `count(==(‘1’), s) and similar, but of course it’s impossible to cover all possible cases.

1 Like

For the record (refer to @stevengj 's comment above):

  • x -> x === '1' is not slower than ==('1'). In this case (we are comparing characters). If I were searching for matches in a vector of strings, I would expect === to be faster. This is what I’ve seen in similiar situations in the past. But, just now in v1.9, counting ocurrences of "cat" in a Vector of "dog" and "cat" is slower with === than ==. shrug_emoji.
  • count(==('1'), s) is 25% faster than sum(==('1'), s) according to a couple of tests.
  • The doc string in the function normalize_bitstring above has an error. The function will work for strings containing non-ascii characters. All bytes but the first in a code point have the high bit set to 1, so they can’t be confused with ascii characters. This is why findnext can make the optimization for an ascii character.

And count(x -> x===(0x31), codeunits(s)) is similarly faster than the sum version.

Thanks. I tried all four, and inadvertently omitted mentioning those:

julia> count_ones_b(s) = sum(==(UInt8('1')), codeunits(s));


julia> count_ones_d(s) = count(==(UInt8('1')), codeunits(s));

julia> @btime count_ones_b($s);
  17.567 ns (0 allocations: 0 bytes)

julia> @btime count_ones_d($s);
  14.220 ns (0 allocations: 0 bytes)

Using codeunits, performance doesn’t suffer when a specialized implementation is not available or not possible at all (eg when the predicate is x -> x == 'a' and not ==('a')). So it’s often better to explicitly use codeunits when only ASCII is relevant, instead of looking for specific optimized methods.

1 Like

I agree. Note also that Julia has a b"foo" string macro for creating codeunit literals with string syntax, so this can make working with codeunit arrays more “string-like”.

So, for example, you can do bytes == b"foo" to compare a byte array to the bytes of "foo", or do append!(bytes, b"foo") to do “byte-string” concatenation in-place.

(I wish there were a similar compact syntax for getting the UInt8 value of 1-byte characters rather than UInt8('x'). I guess we could add a flag to b"..." syntax, like b"x"1 to get 1 byte, but that’s not particularly pretty either.)

Rust has b'A', we could have that as well. It wouldn’t even need to be generic, i.e. implemented with @foo_char macro.

1 Like