Accessing the category of a Char

ultradian · August 12, 2023, 1:53pm

I was playing around and just noticed that there is a category for every Char. For example:

julia> Char(‘a’)
‘a’: ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
julia> Char(‘?’)
‘?’: ASCII/Unicode U+003F (category Po: Punctuation, other)
julia> Char(’ ')
’ ': ASCII/Unicode U+0020 (category Zs: Separator, space)

Is there a way to access the category for a Char variable? Poking around in the show function, I saw a reference to Unicode.category_abbrev(c), but couldn’t figure out how to access it. As someone who parses natural language, I think this could be useful as a direct hook into managing text input in Julia programs (and an alternative for some regex situations)

stevengj · August 12, 2023, 2:00pm

These are currently undocumented internal functions, but are available:

julia> Base.Unicode.category_abbrev('x')
"Ll"

julia> Base.Unicode.category_code('x')
2

julia> Base.Unicode.category_code('x') == Base.Unicode.UTF8PROC_CATEGORY_LL
true

There are also documented predicates like isletter.

ultradian · August 12, 2023, 2:24pm

That’s great! If I wanted to help add documentation for this (I think it is useful!), how would I go about doing that?

rocco_sprmnt21 · August 12, 2023, 2:49pm

from REPL

julia> Base.Unicode.<TAB><TAB>

GraphemeIterator      UTF8PROC_CASEFOLD     UTF8PROC_CATEGORY_CC  UTF8PROC_CATEGORY_CF  UTF8PROC_CATEGORY_CN
UTF8PROC_CATEGORY_CO  UTF8PROC_CATEGORY_CS  UTF8PROC_CATEGORY_LL  UTF8PROC_CATEGORY_LM  UTF8PROC_CATEGORY_LO
UTF8PROC_CATEGORY_LT  UTF8PROC_CATEGORY_LU  UTF8PROC_CATEGORY_MC  UTF8PROC_CATEGORY_ME  UTF8PROC_CATEGORY_MN
UTF8PROC_CATEGORY_ND  UTF8PROC_CATEGORY_NL  UTF8PROC_CATEGORY_NO  UTF8PROC_CATEGORY_PC  UTF8PROC_CATEGORY_PD
UTF8PROC_CATEGORY_PE  UTF8PROC_CATEGORY_PF  UTF8PROC_CATEGORY_PI  UTF8PROC_CATEGORY_PO  UTF8PROC_CATEGORY_PS
UTF8PROC_CATEGORY_SC  UTF8PROC_CATEGORY_SK  UTF8PROC_CATEGORY_SM  UTF8PROC_CATEGORY_SO  UTF8PROC_CATEGORY_ZL
UTF8PROC_CATEGORY_ZP  UTF8PROC_CATEGORY_ZS  UTF8PROC_CHARBOUND    UTF8PROC_COMPAT       UTF8PROC_COMPOSE
UTF8PROC_DECOMPOSE    UTF8PROC_IGNORE       UTF8PROC_LUMP         UTF8PROC_NLF2LF       UTF8PROC_NLF2LS
UTF8PROC_NLF2PS       UTF8PROC_REJECTNA     UTF8PROC_STABLE       UTF8PROC_STRIPCC      UTF8PROC_STRIPMARK
_julia_charmap        category_abbrev       category_code         category_string       category_strings
eval                  graphemes             include               isassigned            iscased
iscntrl               isdigit               isgraphemebreak       isgraphemebreak!      isletter
islowercase           isnumeric             isprint               ispunct               isspace
isuppercase           isxdigit              lowercase             lowercasefirst        normalize
textwidth             titlecase             uppercase             uppercasefirst        utf8proc_custom_func
utf8proc_decompose    utf8proc_error        utf8proc_map

julia> Base.Unicode.category_strings
32-element Vector{String}:
 "Other, not assigned"
 "Letter, uppercase"
 "Letter, lowercase"
 "Letter, titlecase"
 "Letter, modifier"
 "Letter, other"
 "Mark, nonspacing"
 "Mark, spacing combining"
 "Mark, enclosing"
 "Number, decimal digit"
 ⋮
 "Symbol, other"
 "Separator, space"
 "Separator, line"
 "Separator, paragraph"
 "Other, control"
 "Other, format"
 "Other, surrogate"
 "Other, private use"
 "Invalid, too high"
 "Malformed, bad data"

julia> Base.Unicode.category_string
category_string (generic function with 1 method)

julia> Base.Unicode.category_string('a')
"Letter, lowercase"

julia> Base.Unicode.category_string('1')
"Number, decimal digit"

julia> Base.Unicode.category_string('^')
"Symbol, modifier"

stevengj · August 13, 2023, 11:05am

Adding docstrings to these functions is easy — just file a pull request. If we actually want to export something (probably from the Unicode stdlib), we’d want to decide on what iis the most useful API, and this may require some discussion — maybe file an issue first.

Topic		Replies	Views
Why so complex representation of a char? General Usage	48	2565	November 15, 2018
Check Unicode character class General Usage	5	618	September 10, 2019
Steven Johnson's #19847 (more verbose multi-line display for Char) Internals & Design	1	658	January 4, 2017
Hover over Latex/ Unicode Character to see the Latex Sourcecode Juno	5	899	December 2, 2019
Help me write ℒ General Usage first-steps	4	693	May 4, 2019

Accessing the category of a Char

Related topics