Package Quartet: Structured and flexible Identifier types, quick to parse and print
A generic approach to time and space efficient structured representations, applied to academic and biological identifiers
or, alternatively:
How frustration with identifiers lead to a descent down the rabbit hole of optimised parsing: a personal saga
Installing
Estimated time of registration: 2026-04-08T02:07:00Z. In the meantime, you’ll want to:
pkg> add http://code.tecosaur.net/tec/PackedParselets.jl.git
pkg> add http://code.tecosaur.net/tec/FastIdentifiers.jl.git
pkg> add http://code.tecosaur.net/tec/{Academic,Bio}Identifiers.jl.git
The .git suffix is needed because LibGit2 doesn't use the headers that Git does when cloning, meaning I can't create Cloudflare rules that easily distinguish HTTP git traffic from AI scrapers trying to give my forge the hug of death.
Backstory
This set of packages has been a while in the making. About two years ago I started getting fed up with the finicky details of working with many biological identifiers across datasets. This involved:
- Realising that string-based comparisons were both slow and fragile to stylistic differences (like upper/lower case)
- Writing quick ad-hoc parsers for identifiers forms like
HP:0100584from strings to numbers to make comparison quicker in large table joins - Needing to adjust my ad-hoc parsers to handle minor format variations, case differences, etc.
- Forgetting whether an identifier needs to be printed with a certain number of digits or not, so that I can output the right form
- The annoyance of trying to compare identifiers like
ENSG00000139618.11andENSG00000139618.13, where I want to ignore the optional.versionsuffix - Accidently discovering that my ad-hoc parser only works for one of multiple valid forms
- Failing to detect actually malformed identifiers, and either experiencing strange errors or (worse) silent misbehaviour
So, I wrote a small package for parsing a bunch of bibliometric identifiers (the annoyance I was experiencing at the time my level of frustration crossed the “make a package” threshold) called AcademicIdentifiers.jl. I then wanted something similar for biological identifiers, which lead to the creation of a interface + common utilities package, neé AbstractIdentifiers.jl.
I then noticed that a few things like simply parsing integer values within identifiers took longer than expected. This led to a discussion on Julia’s Zulip around how Base’s parsing is distinctly sub-optimal with @jakobnissen (starting off with DMs on Slack), and then a PR to Base improving one aspect of the situation:
The descent into this rabbit hole started off with a simple fastparse function that was ~5x faster than Base’s tryparse (3.7ns vs 18ns to parse "000789012" on my machine):
fastparse implementation
function fastparse(::Type{Union{X, I}}, str::AbstractString, base::Integer = 10) where {X <: Union{Symbol, Nothing}, I <: Integer}
num = zero(I)
bytes = codeunits(str)
isempty(bytes) && return if X == Symbol; :empty end
i, negative = if I <: Signed && (@inbounds first(bytes)) == UInt8('-')
2, true
else
1, false
end
digit = zero(if I <: Signed Int8 else UInt8 end)
# NOTE: Don't ask me why, but it turns out that `while` is
# considerably faster than `for` here (~7ns vs ~4ns).
@inbounds while true
b = bytes[i]
digit = if b ∈ UInt8('0'):(UInt8('0') - 0x1 + min(base, 10) % UInt8)
b - UInt8('0')
elseif 10 < base <= 36 && (b | 0x20) ∈ UInt8('a'):(UInt8('a') - 0x1 + (base - 10) % UInt8)
(b | 0x20) - (UInt8('a') - UInt8(10))
elseif base > 36 && b ∈ UInt8('A'):(UInt8('z') - UInt8(62) + base % UInt)
b - (UInt8('A') - UInt8(10)) - ifelse(b >= UInt8('a'), 0x06, 0x00)
else
return if X == Symbol; :invalid end
end % if I <: Signed Int8 else UInt8 end
numnext = muladd(widen(num), base % I, digit)
iszero(numnext >> (8 * sizeof(I) - (I <: Signed && !negative))) || return if X == Symbol; :overflow end
I <: Signed && negative && i == length(bytes) && break
num = numnext % I
i == length(bytes) && break
i += 1
end
if I <: Signed && negative
muladd(num, -(base % I), -digit)
else
num
end
end
I also started looking into various writings on SWAR (SIMD within a register) parsing, reading the blogs of people like Daniel Lemire (creater of simdjson).
While reading about various methods of quickly checking and parsing integers, I implemented ~45 fast parser implementations for academic and biological itentifiers, using a few helper functions like fastparse and chopprefixes (a version of chopprefix that’s optimised for multiple chops with ASCII case-folding).
Implementing all these parsers, I found myself using a few patterns again and again, which activated my long-standing distaste for boilerplate, and love of macros. Next thing I know, I’ve started work creating a DSL for parsing identifier-like strings.
I had a lot more ideas for how to do this well than time to spend on this, so it spent a ~year about a third implemented, until Opus 4.6 released. With a clear idea of what I wanted to build, and my existing code as a model for the approach/architecture, I was able to drive Opus 4.6 to implement and test the rest of the approaches I had in mind over the past few months. The end result was a library worth spinning off from identifier parsing that I think I can be ambitious enough to say solves the problem of how to efficiently and compactly parse short identifiers: PackedParselets.jl.
Oh, and then I renamed AbstractIdentifiers.jl to FastIdentifiers.jl
which now serves as a thin convenience over PackedParselets + an abstract/interface type, and reimplemented AcademicIdentifiers.jl and BioIdentifiers.jl to use it instead of hand-rolling parsers.
Capabilities
PackedParselets.jl is intended for package authors, not end-users. It defines a sexpr-like DSL that simultaneously defines the parsing and printing of a bitpacked type, e.g.
("https://github.com/JuliaLang/julia/",
:kind(choice("issue", "pull")),
"/",
:num(digits(1:6)),
optional("#issuecomment-",
:comment(digits(10)))
The available segments are:
"<string>"for a string you want to matchoptional(...)for content that may be presentchoice(...)for exclusive values, which may be simple strings or entire sub-sequencesskip("<string>", choice("<strings>", ...), ...)for content that should be skipped:<name>for declaring propertiesdigits(n | lo:hi)for a sequence of digitsletters(n | lo:hi)for a sequence of characters (charsetconvenience)alphnum(n | lo:hi)for a sequence of digits or letters (chasetconvenience)hex(n | lo:hi)for a sequence of hexadecimal characters (charsetconvenience)charset(n | lo:hi, <characters...>)for any sequence of charactersembedanother PackedParselets-defined type- any custom segments your package wants to add
Let’s use this as a case study for how parsing and printing operates.
You might expect that we check for https://github.com/JuliaLang/julia/ by using startswith, but we can do much better than that. By using the input strings codeunits, and checking the length to make sure there’s enough content, we check for that prefix with five masked comparisons:
Base.unsafe_load(Ptr{UInt64}(pointer(data, pos))) & 0xffffffdfdfdfdfdf == 0x2f2f3a5350545448 &&
Base.unsafe_load(Ptr{UInt64}(pointer(data, pos + 8))) & 0xdfffdfdfdfdfdfdf == 0x432e425548544947 &&
Base.unsafe_load(Ptr{UInt64}(pointer(data, pos + 16)) & 0xdfdfdfdfdfffdfdf == 0x41494c554a2f4d4f) &&
Base.unsafe_load(Ptr{UInt64}(pointer(data, pos + 24)) & 0xdfdfdfffdfdfdfdf == 0x4c554a2f474e414c) &&
Base.unsafe_load(Ptr{UInt64}(pointer(data, pos + 32))) & 0x0000000000ffdfdf == 0x00000000002f4149
We can then use the fact that Int('i') % 2 and Int('p') % 2 differ by one to jump to a similar comparison to check for "issue" / "pull". We do a bit of light brute-force perfect hashing to look for a way to jump straight from the possible valid inputs to the verification that the entire value matches.
After "/", we need to parse the issue/pr number, which may be 1 to 6 digits. Because of the requirement that we have already seen https://...{issue,pull}/ that there are at least seven preceding bytes, and so we can jump forwards six bytes, stopping early if we hit the end of the input, and then load the preceding eight bytes into a UInt64. We can then count the number of 0-9 digits at the end, bit-mask + bit-shift the UInt64 accordingly, and parse the entire number at once with SWAR.
attr_num_avail = min(6, (nbytes - pos) + 1)
(nbytes - pos) + 1 >= 1 || return (4, pos)
attr_num_swar = htol(Base.unsafe_load(Ptr{UInt64}(pointer(data, (pos + attr_num_avail) - 8))))
attr_num_swar >>>= (8 - attr_num_avail) << 3
nondig_158 = (attr_num_swar & (attr_num_swar + 0x0606060606060606)) & 0xf0f0f0f0f0f0f0f0 ⊻ 0x3030303030303030
nondig_158 |= 0xffff000000000000
attr_num_count = trailing_zeros(((nondig_158 - 0x0101010101010101) & ~nondig_158) & 0x8080808080808080 ⊻ 0x8080808080808080) >> 3
!(iszero(attr_num_count)) || return (4, pos)
attr_num_swar <<= (8 - attr_num_count) << 3
attr_num_swar &= 0x0f0f0f0f0f0f0000
attr_num_swar = (attr_num_swar * 0x0000000000000a01) >> 8 & 0x00ff00ff00ff00ff
attr_num_swar = (attr_num_swar * 0x0000000000640001) >> 16 & 0x0000ffff0000ffff
attr_num_swar = (attr_num_swar * 0x0000271000000001) >> 32
attr_num_num = attr_num_swar % UInt32
One way of looking at this is as binary reduction:
1 2 3 4 5 6 7 8
╰─┬─╯ ╰─┬─╯ ╰─┬─╯ ╰─┬─╯ ×2561, >>8
12 34 56 78
╰───┬───╯ ╰───┬───╯ ×6553601, >>16
1234 5678
╰───────┬────────╯ ×..., >>32
12345678
After checking for "#issuecomment-", we take a similar approach with the ten digits, loading the input into a UInt64 and a UInt16. The loading stratergy, among other details are described in the docs. For instance, here’s how we’d grab five bytes at once depending how what we know about the larger string it must be situated within:
forward u64
backward u64 ╭───────────┴────╮
╭───────┴──┼─────────╮ │
░ ░ ░ ░ ░ ░ a b c d e · · · ·
╰──parsed──╯╰─target─╯
╰──┬───╯╰╯
u32 u8
exact
The issue/pr kind, number, and optional comment number are all packed into a 56-bit type.
Printing/stringification essentially mirrors parsing, just in reverse. At this point, I don’t think there’s much scope to further optimise parsing or printing, without AVX intrinsics.
So, what sort of performance can you expect when the codegen goes to these great lengths? Well, on my machine:
julia> using FastIdentifiers, Chairmarks
julia> @defid JuliaRepoItem ("https://github.com/JuliaLang/julia/",
:kind(choice("issue", "pull")),
"/",
:num(digits(1:6)),
optional("#issuecomment-",
:comment(digits(10))))
julia> r = parse(JuliaRepoItem, "https://github.com/JuliaLang/julia/pull/60526#issuecomment-3756359334")
JuliaRepoItem:https://github.com/JuliaLang/julia/pull/60526#issuecomment-3756359334
julia> r.kind
:pull
julia> r.num
60526
julia> r.comment
3756359334
julia> show(r)
JuliaRepoItem(:pull, 60526, 3756359334)
julia> randitem() = string("https://github.com/JuliaLang/julia/", ifelse(rand() < 0.5, "issue", "pull"), "/", rand(1:999999), if rand() < 0.5 "" else string("#issuecomment-", rand(1000000000:9999999999)) end)
randitem (generic function with 1 method)
julia> @b randitem() s->parse(JuliaRepoItem, s)
5.806 ns
julia> @b print(devnull, $r)
9.464 ns (1 allocs: 96 bytes)
julia> @b string($r)
36.551 ns (4 allocs: 480 bytes)
It is important to restate that PackedParselets.jl only provides the machinery for producing optimised types. FastIdentifiers.jl provides @defid, an interface for identifiers and a checkdigit segment, and out-of-the-box support for JSON.jl and JSON3.jl via StructTypes.jl and StructUtils.jl package extensions.
With this new foundation, I was able to replace the hand-rolled parsers in AcademicIdentifiers.jl and BioIdentifiers.jl almost entirely with @defid statements, with a couple of notable exceptions such as DOIs — which can’t be expressed with PackedParselets’ DSL due to their (near) unbounded size. This lead to a bunch of nice savings, the most notable of which in my mind would be transforming the ArXiv parser from ~200 loc of gnarly hand-written code that did custom bit-packing to squeeze old and new forms into a single UInt64, into a 32 loc @defid that fits into 49 bytes (rounded up to 56).
The original hand-rolled implementation
struct ArXiv <: AcademicIdentifier
meta::UInt32 # It's this or we go over 8 bytes
number::UInt32
end
function parseid(::Type{ArXiv}, id::SubString)
_, id = lchopfolded(id, "https://", "http://")
isweb, id = lchopfolded(id, "arxiv.org/")
if isweb
prefixend = findfirst('/', id)
isnothing(prefixend) && return MalformedIdentifier{ArXiv}(id, "incomplete ArXiv URL")
id = unsafe_substr(id, prefixend)
else
_, id = lchopfolded(id, "arxiv:")
end
if occursin('/', id)
arxiv_old(id)
else
arxiv_new(id)
end
end
arxiv_meta(archive::UInt8, class::UInt8, year::UInt8, month::UInt8, version::UInt16) =
UInt32(archive) << (32 - 5) +
UInt32(class) << (32 - 11) +
UInt32(year) << (32 - 18) +
UInt32(month) << (32 - 22) +
version
arxiv_archive(arxiv::ArXiv) = (arxiv.meta >> (32 - 5)) % UInt8
arxiv_class(arxiv::ArXiv) = 0x3f & (arxiv.meta >> (32 - 11)) % UInt8
arxiv_year(arxiv::ArXiv) = 0x7f & (arxiv.meta >> (32 - 18)) % UInt8
arxiv_month(arxiv::ArXiv) = 0x0f & (arxiv.meta >> (32 - 22)) % UInt8
arxiv_version(arxiv::ArXiv) = arxiv.meta % UInt16 & 0x03ff
function arxiv_new(id::AbstractString)
ncodeunits(id) >= 6 || return MalformedIdentifier{ArXiv}(id, "is too short to be a valid ArXiv identifier")
bytes = codeunits(id)
bdigit(b::UInt8) = b ∈ 0x30:0x39
local year, month
y1, y2, m1, m2 = @inbounds bytes[1], bytes[2], bytes[3], bytes[4]
all(bdigit, (y1, y2)) || return MalformedIdentifier{ArXiv}(id, "year component (YYmm.nnnnn) must be an integer")
all(bdigit, (m1, m2)) || return MalformedIdentifier{ArXiv}(id, "month component (yyMM.nnnnn) must be an integer")
year = 0xa * (y1 - 0x30) + (y2 - 0x30)
month = 0xa * (m1 - 0x30) + (m2 - 0x30)
month ∈ 1:12 || return MalformedIdentifier{ArXiv}(id, "month component (yyMM.nnnnn) must be between 01 and 12")
(@inbounds bytes[5]) == UInt8('.') || return MalformedIdentifier{ArXiv}(id, "must contain a period separating the date and number component (yymm.nnnnn)")
i, number, version = 6, zero(UInt32), zero(UInt16)
@inbounds while i <= ncodeunits(id)
b = bytes[i]
i += 1
if (b | 0x20) == UInt8('v')
i > ncodeunits(id) && return MalformedIdentifier{ArXiv}(id, "version component must be non-empty")
break
elseif bdigit(b)
number = muladd(number, UInt32(10), b - 0x30)
else
return MalformedIdentifier{ArXiv}(id, "number component (yymm.NNNNN) must be an integer")
end
end
number <= UInt32(99999) || return MalformedIdentifier{ArXiv}(id, "number component (yymm.NNNNN) must no more than 5 digits")
@inbounds while i <= ncodeunits(id)
b = bytes[i]
if bdigit(b)
version = muladd(version, UInt8(10), b - 0x30)
iszero(version & ~0x03ff) || return MalformedIdentifier{ArXiv}(id, "version is larger than the maximum supported value (1023)")
else
return MalformedIdentifier{ArXiv}(id, "version component must be an integer")
end
i += 1
end
ArXiv(arxiv_meta(0x00, 0x00, year, month, version), number)
end
const ARXIV_OLD_ARCHIVES, ARXIV_OLD_CLASSES = let
arxiv_catsubs = (
"astro-ph" => ["CO", "EP", "GA", "HE", "IM", "SR"],
"cond-mat" => ["dis-nn", "mes-hall", "mtrl-sci", "other", "quant-gas", "soft", "stat-mech", "str-el", "supr-con"],
"cs" => ["AI", "AR", "CC", "CE", "CG", "CL", "CR", "CV", "CY", "DB", "DC", "DL", "DM", "DS", "ET",
"FL", "GL", "GR", "GT", "HC", "IR", "IT", "LG", "LO", "MA", "MM", "MS", "NA", "NI", "OH",
"OS", "PF", "PL", "RO", "SC", "SD", "SE", "SI", "SY"],
"econ" => ["EM", "GN", "TH"],
"eess" => ["AS", "IV", "SP", "SY"],
"gr-qc" => String[],
"hep-ex" => String[],
"hep-lat" => String[],
"hep-ph" => String[],
"hep-th" => String[],
"math-ph" => String[],
"math" => ["AC", "AG", "AP", "AT", "CA", "CO", "CT", "CV", "DG", "DS", "FA", "GM", "GN", "GR", "GT",
"HO", "IT", "KT", "LO", "MG", "MP", "NA", "NT", "OA", "OC", "PR", "QA", "RA", "RT", "SG",
"SP", "ST",],
"nlin" => ["AO", "CD", "CG", "PS", "SI"],
"nucl-ex" => String[],
"nucl-th" => String[],
"physics" => ["acc-ph", "ao-ph", "app-ph", "atm-clus", "atom-ph", "bio-ph", "chem-ph", "class-ph",
"comp-ph", "data-an", "ed-pn", "flu-dyn", "gen-ph", "geo-ph", "hist-ph", "ins-det",
"med-ph", "optics", "plasm-ph", "pop-ph", "soc-ph", "space-ph"],
"q-bio" => ["BM", "CB", "GN", "MN", "NC", "OT", "PE", "QM", "SC", "TO"],
"q-fin" => ["CP", "EC", "GN", "MF", "PM", "PR", "RM", "ST", "SR"],
"quant-ph" => String[],
"stat" => ["AP", "CO", "ME", "ML", "OT", "TH"])
map(first, arxiv_catsubs), map(last, arxiv_catsubs)
end
function arxiv_old(id::AbstractString)
bytes = codeunits(id)
slashpos = @something(findfirst(==(UInt8('/')), bytes),
return MalformedIdentifier{ArXiv}(id, "must contain a slash separating the components (archive.class/YYMMNNN)"))
archclass = unsafe_substr(id, 0, slashpos - 1)
numverstr = unsafe_substr(id, slashpos)
dotpos = something(findfirst(==(UInt8('.')), view(bytes, 1:slashpos)), slashpos)
archive = unsafe_substr(archclass, 0, dotpos - 1)
class = unsafe_substr(archclass, dotpos, max(0, slashpos - dotpos - 1))
archiveidx = findfirst(==(archive), ARXIV_OLD_ARCHIVES)
isnothing(archiveidx) && return MalformedIdentifier{ArXiv}(id, "does not use a recognised ArXiv archive name")
classidx = if isempty(class)
0
else
findfirst(==(class), ARXIV_OLD_CLASSES[archiveidx])
end
isnothing(classidx) && return MalformedIdentifier{ArXiv}(id, "does not use a recognised ArXiv archive class")
length(class) ∈ (0, 2) || return MalformedIdentifier{ArXiv}(id, "class component must be 2 characters")
#--
ncodeunits(numverstr) >= 5 || return MalformedIdentifier{ArXiv}(id, "is too short to be a valid ArXiv identifier")
bytes = codeunits(numverstr)
bdigit(b::UInt8) = b ∈ 0x30:0x39
local year, month
y1, y2, m1, m2 = @inbounds bytes[1], bytes[2], bytes[3], bytes[4]
all(bdigit, (y1, y2)) || return MalformedIdentifier{ArXiv}(id, "year component (YYmmnnnn) must be an integer")
all(bdigit, (m1, m2)) || return MalformedIdentifier{ArXiv}(id, "month component (yyMMnnnn) must be an integer")
year = 0xa * (y1 - 0x30) + (y2 - 0x30)
month = 0xa * (m1 - 0x30) + (m2 - 0x30)
(year >= 91 || year <= 7) || return MalformedIdentifier{ArXiv}(id, "year component (YYmmnnn) must be between 91 and 07")
month ∈ 1:12 || return MalformedIdentifier{ArXiv}(id, "month component (yyMMnnnn) must be between 01 and 12")
i, number, version = 5, zero(UInt32), zero(UInt16)
@inbounds while i <= ncodeunits(numverstr)
b = bytes[i]
i += 1
if (b | 0x20) == UInt8('v')
i > ncodeunits(id) && return MalformedIdentifier{ArXiv}(id, "version component must be non-empty")
break
elseif bdigit(b)
number = muladd(number, UInt32(10), b - 0x30)
else
return MalformedIdentifier{ArXiv}(id, "number component (yymmNNNN) must be an integer")
end
end
number <= UInt32(9999) || return MalformedIdentifier{ArXiv}(id, "number component (yymmNNNN) must no more than 4 digits")
@inbounds while i <= ncodeunits(numverstr)
b = bytes[i]
if bdigit(b)
version = muladd(version, UInt8(10), b - 0x30)
iszero(version & ~0x03ff) || return MalformedIdentifier{ArXiv}(id, "version is larger than the maximum supported value (1023)")
else
return MalformedIdentifier{ArXiv}(id, "version component must be an integer")
end
i += 1
end
#--
ArXiv(arxiv_meta(archiveidx % UInt8, classidx % UInt8, year, month, version), number)
end
function shortcode(io::IO, arxiv::ArXiv)
archid, classid = arxiv_archive(arxiv), arxiv_class(arxiv)
if !iszero(archid) # Old form
print(io, ARXIV_OLD_ARCHIVES[archid])
if !iszero(classid)
print(io, '.', ARXIV_OLD_CLASSES[archid][classid])
end
print(io, '/')
end
year, month, ver = arxiv_year(arxiv), arxiv_month(arxiv), arxiv_version(arxiv)
print(io, lpad(year, 2, '0'), lpad(month, 2, '0'))
if iszero(archid) # New form
print(io, '.', lpad(arxiv.number, ifelse(year >= 15, 5, 4), '0'))
else # Old form
print(io, lpad(arxiv.number, 3, '0'))
end
if ver > 0
print(io, 'v', ver)
end
end
idcode(arxiv::ArXiv) =
UInt64(arxiv.meta & 0xffffff00) << 32 +
UInt64(arxiv.number) << 8 +
arxiv_version(arxiv)
purlprefix(::Type{ArXiv}) = "https://arxiv.org/abs/"
function Base.print(io::IO, arxiv::ArXiv)
get(io, :limit, false) === true && get(io, :compact, false) === true ||
print(io, "arXiv:")
shortcode(io, arxiv)
end
The new @defid implementation
@defid(ArXiv <: AcademicIdentifier,
(choice(:format,
:new => seq(:year(digits(2, pad=2)),
:month(digits(2, min=1, max=12, pad=2)),
".", :num(digits(4:5, pad=4)),
optional("v", :ver(digits(max=1023)))),
:old => seq(choice(:archive,
:astro_ph => seq("astro-ph.", :class(choice("CO", "EP", "GA", "HE", "IM", "SR"))),
:cond_mat => seq("cond-mat.", :class(choice("dis-nn", "mes-hall", "mtrl-sci", "other", "quant-gas", "soft", "stat-mech", "str-el", "supr-con"))),
:cs => seq("cs.", :class(choice("AI", "AR", "CC", "CE", "CG", "CL", "CR", "CV", "CY", "DB", "DC", "DL", "DM", "DS", "ET", "FL", "GL", "GR", "GT", "HC", "IR", "IT", "LG", "LO", "MA", "MM", "MS", "NA", "NI", "OH", "OS", "PF", "PL", "RO", "SC", "SD", "SE", "SI", "SY"))),
:econ => seq("econ.", :class(choice("EM", "GN", "TH"))),
:eess => seq("eess.", :class(choice("AS", "IV", "SP", "SY"))),
:gr_qc => "gr-qc",
:hep_ex => "hep-ex",
:hep_lat => "hep-lat",
:hep_ph => "hep-ph",
:hep_th => "hep-th",
:math_ph => "math-ph",
:math => seq("math.", :class(choice("AC", "AG", "AP", "AT", "CA", "CO", "CT", "CV", "DG", "DS", "FA", "GM", "GN", "GR", "GT", "HO", "IT", "KT", "LO", "MG", "MP", "NA", "NT", "OA", "OC", "PR", "QA", "RA", "RT", "SG", "SP", "ST"))),
:nlin => seq("nlin.", :class(choice("AO", "CD", "CG", "PS", "SI"))),
:nucl_ex => "nucl-ex",
:nucl_th => "nucl-th",
:physics => seq("physics.", :class(choice("acc-ph", "ao-ph", "app-ph", "atm-clus", "atom-ph", "bio-ph", "chem-ph", "class-ph", "comp-ph", "data-an", "ed-pn", "flu-dyn", "gen-ph", "geo-ph", "hist-ph", "ins-det", "med-ph", "optics", "plasm-ph", "pop-ph", "soc-ph", "space-ph"))),
:q_bio => seq("q-bio.", :class(choice("BM", "CB", "GN", "MN", "NC", "OT", "PE", "QM", "SC", "TO"))),
:q_fin => seq("q-fin.", :class(choice("CP", "EC", "GN", "MF", "PM", "PR", "RM", "ST", "SR"))),
:quant_ph => "quant-ph",
:stat => seq("stat.", :class(choice("AP", "CO", "ME", "ML", "OT", "TH")))),
"/", :year(digits(2, pad=2, exclude=8:90)),
:month(digits(2, min=1, max=12, pad=2)),
:num(digits(3:4, pad=3)),
optional("v", :ver(digits(max=1023)))))),
prefix="arXiv:", purlprefix="https://arxiv.org/abs/")
Along with the reduction in loc, the @defid implementation is ~9x faster to boot, parsing ArXiv IDs in 10-15ns ![]()
While enjoying @defid, I also implement an About.jl package extension that shows you how values are represented/packed, e.g.
along with a StyledStrings extenson that’s used for 3-arg show, and error display:
This is used to provide the following academic identifiers:
ArXiv: arXiv preprint identifiersDOI: Digital Object IdentifiersEAN13: European Article Numbers (13-digit barcodes)ISBN: International Standard Book NumbersISNI: International Standard Name IdentifiersISSN: International Standard Serial NumbersOCN: OCLC Control NumbersOpenAlexID: OpenAlex entity identifiersORCID: Open Researcher and Contributor IdentifiersPMCID: PubMed Central IdentifiersPMID: PubMed IdentifiersRAiD: Research Activity IdentifiersROR: Research Organization Registry identifiersVIAF: Virtual International Authority File identifiersWikidata: Wikidata entity identifiers
Notably, ISBN also includes hyphentation rules:
julia> parse(ISBN, "9781718502765")
ISBN:978-1-7185-0276-5
This uses one of the tricks I’m rather enjoying as of late, data .jl files that can be executed as self-updating shell scripts, see: isbn-hyphenation.jl.
Over in biology land, we provide a litany of identifiers you might encounter:
Protein & Structure
AFDB: AlphaFold Database protein structure predictionsPDB: Protein Data Bank macromolecular structuresUniProt: Universal Protein Resource sequencesUniRef: UniProt Reference ClustersIntAct: Molecular interaction databaseInterPro: Protein family/domain classificationPfam: Protein family HMM profilesPXD: ProteomeXchange dataset identifiers
Genomics & Genetics
ENSG,ENST,ENSP,ENSE,ENSR,ENSF,ENSFM: Ensembl genome annotationsNCBIGene: NCBI Entrez Gene databaseRefSeq: NCBI Reference Sequence databaseHGNC: HUGO Gene Nomenclature CommitteeINSDC: International Nucleotide Sequence Database CollaborationOMIM: Online Mendelian Inheritance in Man
Variation & Clinical
CA: ClinGen Allele RegistryClinVar: Clinical genomic variant databasedbSNP: Single Nucleotide Polymorphism databasedbVar: Genomic structural variation databaseGWAS: NHGRI-EBI GWAS Catalog studies
Expression & Functional Genomics
ArrayExpress: Functional genomics data archiveGEO: Gene Expression Omnibus
Studies & Samples
BioProject: Biological research projectsBioSample: Biological sample metadataClinicalTrials: ClinicalTrials.gov registrydbGaP: Database of Genotypes and PhenotypesEGA: European Genome-phenome ArchiveRRID: Research Resource IdentifiersSRA: Sequence Read Archive
Ontologies & Controlled Vocabularies
CL: Cell OntologyDOID: Disease OntologyECO: Evidence & Conclusion OntologyEFO: Experimental Factor OntologyGO: Gene OntologyHPO: Human Phenotype OntologyMeSH: Medical Subject HeadingsMONDO: Monarch Disease OntologyMP: Mammalian Phenotype OntologyPATO: Phenotype And Trait OntologySO: Sequence OntologyUBERON: Uber-anatomy Ontology
Chemical & Metabolic
ChEBI: Chemical Entities of Biological InterestChEMBL: Bioactive compound databaseDrugBank: Drug and pharmaceutical databaseHMDB: Human Metabolome DatabaseKEGG: Kyoto Encyclopedia of Genes and GenomesMetaboLights: Metabolomics study archivePubChem: Chemical compound/substance/assay database
Networks & Interactions
BioGRID: Biological interaction datasetsReactome: Curated biological pathwaysWikiPathways: Community pathway database
Cell Lines & Model Organisms
Cellosaurus: Cell line registryFlyBase: Drosophila gene databaseMGI: Mouse Genome InformaticsNCBITaxon: NCBI Taxonomy databaseSGD: Saccharomyces Genome DatabaseWormBase: Caenorhabditis gene database
Showing off
For fun, in terms of current parsing/printing performance, for fixed bases, PackedParselets offers faster numbers than Base.
julia> @b string(rand(UInt64)) s->parse(UInt64, s)
32.227 ns
julia> @b rand(UInt64) s->print(devnull, s)
20.071 ns (2 allocs: 80 bytes)
julia> @defid MyU64 digits(UInt64)
julia> @b string(rand(UInt64)) s->parse(MyU64, s)
8.401 ns
julia> @b reinterpret(MyU64, rand(UInt64)) s->print(devnull, s)
7.526 ns (1 allocs: 48 bytes)

