Efficiency of parsing ASCII vs. Unicode

What about performance implications for parsing ASCII text files? Consider the following function which checks if a file has an equal number of left brackets and right brackets:

function file_has_balanced_brackets(file)
    s = read(file, String)
    counter = 0
    for c in s
        if c == '('
            counter += 1
        elseif c == ')'
            counter -= 1
        end
    end
    return (counter == 0)
end

Since characters in Julia have 4-byte sizes, rather than 1 byte for ASCII, is there a performance penalty for the code above?

ASCII characters are only 1 byte in UTF-8.

1 Like

See this:

julia> sizeof('a')
4

julia> sizeof(UInt8('a'))
1

That’s why I’m wondering if I’m losing performance by using Char literals like 'a', rather than e.g. the more verbose UInt8('a') when parsing ASCII files.

Ah sorry I misunderstood. I don’t think working with 32-bit values is typically slower than 8-bit on common architectures but certainly in some contexts it must be less efficient… But the penalty will probably be with the string type (UTF-8) rather than the character type, for example length(str) is inefficient. In Julia if that’s a problem and you know you don’t need UTF-8, you can easily use another string type from LegacyStrings.jl.

High-performance parsers like CSV.jl often work with bytes (“code units” of strings) rather than decoding into Unicode characters, and similarly high-performance string processing often works with codeunits(str). Functions like findnext also have special optimizations when searching for a Char in a String that convert it to a raw byte first. The UTF-8 representation allows you to work with bytes efficiently for searching/matching operations.

4 Likes

Just to expand on @stevengj’s excellent answer @greatpet, there’s a difference between characters on their own, and characters in a string. So even though 'a' takes up 8 bytes, the string "abcdefgh" also only stores 8 bytes because of the encoding strategy that’s standard for working with unicode text.

That is, an ascii character only takes up one byte when it is a part of a string.

Here’s a demo:

julia> sizeof(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
32

vs

julia> sizeof("abcdefgh")
8

Now it is however true that working with strings statically known to be pure ASCII can still have performance benefits because there’s less branching and indirection in how it’s dealt with, but it’s memory footprint is actually not one of the differences.

For many operations, it doesn’t matter whether they are pure ASCII. Suppose that you have the string s = "🐻 + (🐨 + 🐼)" and you want to find the character '(' = U+0028. Then you can search the bytes, given by codeunits(s), for the byte 0x28:

julia> codeunits(s)
20-element Base.CodeUnits{UInt8, String}:
 0xf0
 0x9f
 0x90
 0xbb
 0x20
 0x2b
 0x20
 0x28
 0xf0
 0x9f
 0x90
 0xa8
 0x20
 0x2b
 0x20
 0xf0
 0x9f
 0x90
 0xbc
 0x29

julia> findfirst(==(0x28), codeunits(s))
8

In fact, this is what findfirst(==('('), s) does internally.

The point is that when you are parsing files, you are mostly looking for either ASCII characters like ',' or '\n', in which case you can search bytes, or for substrings, in which case you can also look for substrings of bytes. You rarely need to decode things into Unicode characters for many common file formats.

Yes, of course, but there’s still circumstances where parsing pure ascii strings can be faster than parsing full on UTF-8 strings from what I understand.

But yes, I agree that typically there is not a significant different, especially not one we should care about when talking about the parsing performance of julia.

From what I understand, the difference between pure ASCII and general unicode strings ends up being more of a problem for quasi-numerical things like when bioinformatics people use them to encode DNA sequences (but of course in that circumstance you’re typically better off using a more specialized encoding than ASCII)

As I don’t remember the ASCII coding by heart, I wish “Char macros” existed in a way similar to string macros, so I could write this instead:

findfirst(==(uint8'('), codeunits(s))

Currently I would write

findfirst(==(UInt8('(')), codeunits(s))

knowing that the compiler will carry out the trivial constant folding. This unfortunately make the code a little more verbose, as the UInt8 conversion appears everywhere in my source file.

Another minor inconvenience is that the Base functions for classifying characters, such as isspace and isletter, only accept Char but not UInt8 arguments, so I have to cook up my own versions like this one:

function isspace_uint8(c::UInt8)
    c in (UInt8('\t'), UInt8('\n'), UInt8('\v'), UInt8('\f'), UInt8('\r'), UInt8(' '))
end

Alternatively, I could define

isspace_uint8(c::UInt8) = isspace(Char(c))

but this doesn’t seem ideal for performance.

Yes, that has been raised before. Unfortunately it’s not possible with ' syntax because of the way x'y' bar is parsed as x' * y'. This was discussed in custom character literal macros? · Issue #26305 · JuliaLang/julia · GitHub

However, it’s perfectly possible to define a string macro that returns a UInt8. For example JuliaSyntax.jl defines a macro u8_str do precisely this: u8"x" returns UInt8('x').

One possibility that has come up from time to time is having a new ASCIIChar subtype of AbstractChar that allows you to introduce new methods that dispatch on this.

3 Likes

This sounds like an argument for short variable names😉

In this context, at least, α beats alpha.

1 Like