Accessing the category of a Char

I was playing around and just noticed that there is a category for every Char. For example:

julia> Char(‘a’)
‘a’: ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
julia> Char(‘?’)
‘?’: ASCII/Unicode U+003F (category Po: Punctuation, other)
julia> Char(’ ')
’ ': ASCII/Unicode U+0020 (category Zs: Separator, space)

Is there a way to access the category for a Char variable? Poking around in the show function, I saw a reference to Unicode.category_abbrev(c), but couldn’t figure out how to access it. As someone who parses natural language, I think this could be useful as a direct hook into managing text input in Julia programs (and an alternative for some regex situations)

These are currently undocumented internal functions, but are available:

julia> Base.Unicode.category_abbrev('x')
"Ll"

julia> Base.Unicode.category_code('x')
2

julia> Base.Unicode.category_code('x') == Base.Unicode.UTF8PROC_CATEGORY_LL
true

There are also documented predicates like isletter.

1 Like

That’s great! If I wanted to help add documentation for this (I think it is useful!), how would I go about doing that?

from REPL

julia> Base.Unicode.<TAB><TAB>

GraphemeIterator      UTF8PROC_CASEFOLD     UTF8PROC_CATEGORY_CC  UTF8PROC_CATEGORY_CF  UTF8PROC_CATEGORY_CN
UTF8PROC_CATEGORY_CO  UTF8PROC_CATEGORY_CS  UTF8PROC_CATEGORY_LL  UTF8PROC_CATEGORY_LM  UTF8PROC_CATEGORY_LO
UTF8PROC_CATEGORY_LT  UTF8PROC_CATEGORY_LU  UTF8PROC_CATEGORY_MC  UTF8PROC_CATEGORY_ME  UTF8PROC_CATEGORY_MN
UTF8PROC_CATEGORY_ND  UTF8PROC_CATEGORY_NL  UTF8PROC_CATEGORY_NO  UTF8PROC_CATEGORY_PC  UTF8PROC_CATEGORY_PD
UTF8PROC_CATEGORY_PE  UTF8PROC_CATEGORY_PF  UTF8PROC_CATEGORY_PI  UTF8PROC_CATEGORY_PO  UTF8PROC_CATEGORY_PS
UTF8PROC_CATEGORY_SC  UTF8PROC_CATEGORY_SK  UTF8PROC_CATEGORY_SM  UTF8PROC_CATEGORY_SO  UTF8PROC_CATEGORY_ZL
UTF8PROC_CATEGORY_ZP  UTF8PROC_CATEGORY_ZS  UTF8PROC_CHARBOUND    UTF8PROC_COMPAT       UTF8PROC_COMPOSE
UTF8PROC_DECOMPOSE    UTF8PROC_IGNORE       UTF8PROC_LUMP         UTF8PROC_NLF2LF       UTF8PROC_NLF2LS
UTF8PROC_NLF2PS       UTF8PROC_REJECTNA     UTF8PROC_STABLE       UTF8PROC_STRIPCC      UTF8PROC_STRIPMARK
_julia_charmap        category_abbrev       category_code         category_string       category_strings
eval                  graphemes             include               isassigned            iscased
iscntrl               isdigit               isgraphemebreak       isgraphemebreak!      isletter
islowercase           isnumeric             isprint               ispunct               isspace
isuppercase           isxdigit              lowercase             lowercasefirst        normalize
textwidth             titlecase             uppercase             uppercasefirst        utf8proc_custom_func
utf8proc_decompose    utf8proc_error        utf8proc_map
julia> Base.Unicode.category_strings
32-element Vector{String}:
 "Other, not assigned"
 "Letter, uppercase"
 "Letter, lowercase"
 "Letter, titlecase"
 "Letter, modifier"
 "Letter, other"
 "Mark, nonspacing"
 "Mark, spacing combining"
 "Mark, enclosing"
 "Number, decimal digit"
 ⋮
 "Symbol, other"
 "Separator, space"
 "Separator, line"
 "Separator, paragraph"
 "Other, control"
 "Other, format"
 "Other, surrogate"
 "Other, private use"
 "Invalid, too high"
 "Malformed, bad data"

julia> Base.Unicode.category_string
category_string (generic function with 1 method)

julia> Base.Unicode.category_string('a')
"Letter, lowercase"

julia> Base.Unicode.category_string('1')
"Number, decimal digit"

julia> Base.Unicode.category_string('^')
"Symbol, modifier"

Adding docstrings to these functions is easy — just file a pull request. If we actually want to export something (probably from the Unicode stdlib), we’d want to decide on what iis the most useful API, and this may require some discussion — maybe file an issue first.