Dispatch on Unicode categories

Hi, I’ve been using Python for a long time and am starting to learn Julia a bit by going through the MIT course as well as Exercism exercises.

I recently came across a simple ROT13 exercise in Exercism. As part of the solution, I need to do slightly different things based on if a character is uppercase, lowercase, or some other non-alphabetic character.

I don’t want to actually show the solution code here since it’s an exercise and that’s not the point of this post, but my basic structure is simply:

function rotate(k, c::AbstractChar)
    if isuppercase(c)
        # do uppercase ROT-k calculation
    elseif islowercase(c)
        # do lowercase ROT-k calculation
    else
        # do nothing
    end
end

function rotate(k, str::AbstractString)
    # do entire string ROT-k calculation
end

Now, if I do typeof('a') I just get Char, but if I do Char('a') I get:

'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

And I’m wondering if there’s any way to dispatch on any of the given information there, for example to do something like(guessing at syntax here):

rotate(k, str::AbstractString) = # do entire string ROT-k calculation

rotate(k, c::AbstractChar) = # do nothing

rotate(k, c::Unicode{Ll}) = # do lowercase ROT-k calculation

rotate(k, c::Unicode{Lu}) = # do uppercase ROT-k calculation

And if so, does this have any performance(or other) implications, positive or negative?

Thanks!

It’s not possible as you describe it here, since you can only dispatch on types and things like Unicode categories are only known at runtime. It is possible to move that information into the type domain though, for example with Val. See the section in the docs about value types:

You could do the following, for example:

julia> f(c) = _f(c, Val(Symbol(Base.Unicode.category_abbrev(c))))
f (generic function with 1 method)

julia> _f(c, ::Val{:Ll}) = print("$c is in Ll")
_f (generic function with 1 method)

julia> _f(c, _) = print("$c in other")
_f (generic function with 2 methods)

julia> f('a')
a is in Ll
julia> f('.')
. in other

Runtime dispatch like this will typically be slower than if-else statements, so use this with care.

2 Likes

Yes, I wouldn’t use dispatch for this. Just use if-else statements, or possibly a dictionary lookup.

You can get the unicode category, though, which might be helpful here.

If you call the (undocumented internal) Base.Unicode.category_code(char) function, it returns one of the category codes from utf8proc. e.g. Base.Unicode.category_code('x') returns 2, which is the same as Base.Unicode.UTF8PROC_CATEGORY_LL.

That being said, my understanding is that ROT13 is really only for ASCII Latin characters, so you can just check 'a' ≤ char ≤ 'z' etcetera.

2 Likes

Thanks for the replies!

Definitely agreed that this kind of dispatch would be wild overkill for something like ROT13, this was more of a, ‘I wonder if…’ scenario. Since I’m coming from another language, I’m kinda trying to figure out both what’s possible, and what’s actually a good idea. :slight_smile:

Just because I was curious about the actual performance implications you mentioned, I tried benchmarking both solutions:

_rotate(k, c, ::Val{:Ll}) = # do lowercase calc

_rotate(k, c, ::Val{:Lu}) = # do uppercase calc

_rotate(k, c, _) = c

rotate(k, str::AbstractString) = # do entire string calc

rotate(k, c::AbstractChar) = _rotate(k, c, Val(Symbol(Base.Unicode.category_abbrev(c))))
using BenchmarkTools, Random
include("rotational-cipher.jl")
characters = ('a':'z', 'A':'Z', ".,!?;: ") |> Iterators.flatten |> collect
@benchmark rotate(13, x) setup=(x=randstring(characters, 20))

And got the following for the dispatch version:

BenchmarkTools.Trial: 
  memory estimate:  816 bytes
  allocs estimate:  23
  --------------
  minimum time:     64.841 μs (0.00% GC)
  median time:      66.142 μs (0.00% GC)
  mean time:        67.810 μs (0.00% GC)
  maximum time:     141.109 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

And for the conditional version:

BenchmarkTools.Trial: 
  memory estimate:  176 bytes
  allocs estimate:  3
  --------------
  minimum time:     272.926 ns (0.00% GC)
  median time:      304.735 ns (0.00% GC)
  mean time:        316.575 ns (0.64% GC)
  maximum time:     2.380 μs (85.36% GC)
  --------------
  samples:          10000
  evals/sample:     298

So as you both mentioned, there’s a significant performance hit for doing things this way. Thanks again for pointing me to how to do this, and why it’s probably not the best idea!

3 Likes