I was playing around and just noticed that there is a category
for every Char
. For example:
julia> Char(‘a’)
‘a’: ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
julia> Char(‘?’)
‘?’: ASCII/Unicode U+003F (category Po: Punctuation, other)
julia> Char(’ ')
’ ': ASCII/Unicode U+0020 (category Zs: Separator, space)
Is there a way to access the category
for a Char
variable? Poking around in the show
function, I saw a reference to Unicode.category_abbrev(c)
, but couldn’t figure out how to access it. As someone who parses natural language, I think this could be useful as a direct hook into managing text input in Julia programs (and an alternative for some regex situations)
These are currently undocumented internal functions, but are available:
julia> Base.Unicode.category_abbrev('x')
"Ll"
julia> Base.Unicode.category_code('x')
2
julia> Base.Unicode.category_code('x') == Base.Unicode.UTF8PROC_CATEGORY_LL
true
There are also documented predicates like isletter
.
1 Like
That’s great! If I wanted to help add documentation for this (I think it is useful!), how would I go about doing that?
from REPL
julia> Base.Unicode.<TAB><TAB>
GraphemeIterator UTF8PROC_CASEFOLD UTF8PROC_CATEGORY_CC UTF8PROC_CATEGORY_CF UTF8PROC_CATEGORY_CN
UTF8PROC_CATEGORY_CO UTF8PROC_CATEGORY_CS UTF8PROC_CATEGORY_LL UTF8PROC_CATEGORY_LM UTF8PROC_CATEGORY_LO
UTF8PROC_CATEGORY_LT UTF8PROC_CATEGORY_LU UTF8PROC_CATEGORY_MC UTF8PROC_CATEGORY_ME UTF8PROC_CATEGORY_MN
UTF8PROC_CATEGORY_ND UTF8PROC_CATEGORY_NL UTF8PROC_CATEGORY_NO UTF8PROC_CATEGORY_PC UTF8PROC_CATEGORY_PD
UTF8PROC_CATEGORY_PE UTF8PROC_CATEGORY_PF UTF8PROC_CATEGORY_PI UTF8PROC_CATEGORY_PO UTF8PROC_CATEGORY_PS
UTF8PROC_CATEGORY_SC UTF8PROC_CATEGORY_SK UTF8PROC_CATEGORY_SM UTF8PROC_CATEGORY_SO UTF8PROC_CATEGORY_ZL
UTF8PROC_CATEGORY_ZP UTF8PROC_CATEGORY_ZS UTF8PROC_CHARBOUND UTF8PROC_COMPAT UTF8PROC_COMPOSE
UTF8PROC_DECOMPOSE UTF8PROC_IGNORE UTF8PROC_LUMP UTF8PROC_NLF2LF UTF8PROC_NLF2LS
UTF8PROC_NLF2PS UTF8PROC_REJECTNA UTF8PROC_STABLE UTF8PROC_STRIPCC UTF8PROC_STRIPMARK
_julia_charmap category_abbrev category_code category_string category_strings
eval graphemes include isassigned iscased
iscntrl isdigit isgraphemebreak isgraphemebreak! isletter
islowercase isnumeric isprint ispunct isspace
isuppercase isxdigit lowercase lowercasefirst normalize
textwidth titlecase uppercase uppercasefirst utf8proc_custom_func
utf8proc_decompose utf8proc_error utf8proc_map
julia> Base.Unicode.category_strings
32-element Vector{String}:
"Other, not assigned"
"Letter, uppercase"
"Letter, lowercase"
"Letter, titlecase"
"Letter, modifier"
"Letter, other"
"Mark, nonspacing"
"Mark, spacing combining"
"Mark, enclosing"
"Number, decimal digit"
⋮
"Symbol, other"
"Separator, space"
"Separator, line"
"Separator, paragraph"
"Other, control"
"Other, format"
"Other, surrogate"
"Other, private use"
"Invalid, too high"
"Malformed, bad data"
julia> Base.Unicode.category_string
category_string (generic function with 1 method)
julia> Base.Unicode.category_string('a')
"Letter, lowercase"
julia> Base.Unicode.category_string('1')
"Number, decimal digit"
julia> Base.Unicode.category_string('^')
"Symbol, modifier"
Adding docstrings to these functions is easy — just file a pull request. If we actually want to export something (probably from the Unicode stdlib), we’d want to decide on what iis the most useful API, and this may require some discussion — maybe file an issue first.