What about performance implications for parsing ASCII text files? Consider the following function which checks if a file has an equal number of left brackets and right brackets:
function file_has_balanced_brackets(file)
s = read(file, String)
counter = 0
for c in s
if c == '('
counter += 1
elseif c == ')'
counter -= 1
end
end
return (counter == 0)
end
Since characters in Julia have 4-byte sizes, rather than 1 byte for ASCII, is there a performance penalty for the code above?
Thatās why Iām wondering if Iām losing performance by using Char literals like 'a', rather than e.g. the more verbose UInt8('a') when parsing ASCII files.
Ah sorry I misunderstood. I donāt think working with 32-bit values is typically slower than 8-bit on common architectures but certainly in some contexts it must be less efficientā¦ But the penalty will probably be with the string type (UTF-8) rather than the character type, for example length(str) is inefficient. In Julia if thatās a problem and you know you donāt need UTF-8, you can easily use another string type from LegacyStrings.jl.
High-performance parsers like CSV.jl often work with bytes (ācode unitsā of strings) rather than decoding into Unicode characters, and similarly high-performance string processing often works with codeunits(str). Functions like findnext also have special optimizations when searching for a Char in a String that convert it to a raw byte first. The UTF-8 representation allows you to work with bytes efficiently for searching/matching operations.
Just to expand on @stevengjās excellent answer @greatpet, thereās a difference between characters on their own, and characters in a string. So even though 'a' takes up 8 bytes, the string "abcdefgh" also only stores 8 bytes because of the encoding strategy thatās standard for working with unicode text.
That is, an ascii character only takes up one byte when it is a part of a string.
Now it is however true that working with strings statically known to be pure ASCII can still have performance benefits because thereās less branching and indirection in how itās dealt with, but itās memory footprint is actually not one of the differences.
For many operations, it doesnāt matter whether they are pure ASCII. Suppose that you have the string s = "š» + (šØ + š¼)" and you want to find the character '(' = U+0028. Then you can search the bytes, given by codeunits(s), for the byte 0x28:
In fact, this is what findfirst(==('('), s) does internally.
The point is that when you are parsing files, you are mostly looking for either ASCII characters like ',' or '\n', in which case you can search bytes, or for substrings, in which case you can also look for substrings of bytes. You rarely need to decode things into Unicode characters for many common file formats.
Yes, of course, but thereās still circumstances where parsing pure ascii strings can be faster than parsing full on UTF-8 strings from what I understand.
But yes, I agree that typically there is not a significant different, especially not one we should care about when talking about the parsing performance of julia.
From what I understand, the difference between pure ASCII and general unicode strings ends up being more of a problem for quasi-numerical things like when bioinformatics people use them to encode DNA sequences (but of course in that circumstance youāre typically better off using a more specialized encoding than ASCII)
As I donāt remember the ASCII coding by heart, I wish āChar macrosā existed in a way similar to string macros, so I could write this instead:
findfirst(==(uint8'('), codeunits(s))
Currently I would write
findfirst(==(UInt8('(')), codeunits(s))
knowing that the compiler will carry out the trivial constant folding. This unfortunately make the code a little more verbose, as the UInt8 conversion appears everywhere in my source file.
Another minor inconvenience is that the Base functions for classifying characters, such as isspace and isletter, only accept Char but not UInt8 arguments, so I have to cook up my own versions like this one:
function isspace_uint8(c::UInt8)
c in (UInt8('\t'), UInt8('\n'), UInt8('\v'), UInt8('\f'), UInt8('\r'), UInt8(' '))
end
However, itās perfectly possible to define a string macro that returns a UInt8. For example JuliaSyntax.jl defines a macro u8_str do precisely this: u8"x" returns UInt8('x').
One possibility that has come up from time to time is having a new ASCIIChar subtype of AbstractChar that allows you to introduce new methods that dispatch on this.