[ANN] A quartet of identifier-related parsing packages: PackedParselets.jl, FastIdentifiers.jl, AcademicIdentifiers.jl, BioIdentifiers.jl

tecosaur · April 5, 2026, 4:58pm

Package Quartet: Structured and flexible Identifier types, quick to parse and print

A generic approach to time and space efficient structured representations, applied to academic and biological identifiers

or, alternatively:

How frustration with identifiers lead to a descent down the rabbit hole of optimised parsing: a personal saga

Backstory

This set of packages has been a while in the making. About two years ago I started getting fed up with the finicky details of working with many biological identifiers across datasets. This involved:

Realising that string-based comparisons were both slow and fragile to stylistic differences (like upper/lower case)
Writing quick ad-hoc parsers for identifiers forms like HP:0100584 from strings to numbers to make comparison quicker in large table joins
Needing to adjust my ad-hoc parsers to handle minor format variations, case differences, etc.
Forgetting whether an identifier needs to be printed with a certain number of digits or not, so that I can output the right form
The annoyance of trying to compare identifiers like ENSG00000139618.11 and ENSG00000139618.13, where I want to ignore the optional .version suffix
Accidently discovering that my ad-hoc parser only works for one of multiple valid forms
Failing to detect actually malformed identifiers, and either experiencing strange errors or (worse) silent misbehaviour

So, I wrote a small package for parsing a bunch of bibliometric identifiers (the annoyance I was experiencing at the time my level of frustration crossed the “make a package” threshold) called AcademicIdentifiers.jl. I then wanted something similar for biological identifiers, which lead to the creation of a interface + common utilities package, neé AbstractIdentifiers.jl.

I then noticed that a few things like simply parsing integer values within identifiers took longer than expected. This led to a discussion on Julia’s Zulip around how Base’s parsing is distinctly sub-optimal with @jakobnissen (starting off with DMs on Slack), and then a PR to Base improving one aspect of the situation:

The descent into this rabbit hole started off with a simple fastparse function that was ~5x faster than Base’s tryparse (3.7ns vs 18ns to parse "000789012" on my machine):

fastparse implementation

function fastparse(::Type{Union{X, I}}, str::AbstractString, base::Integer = 10) where {X <: Union{Symbol, Nothing}, I <: Integer}
    num = zero(I)
    bytes = codeunits(str)
    isempty(bytes) && return if X == Symbol; :empty end
    i, negative = if I <: Signed && (@inbounds first(bytes)) == UInt8('-')
        2, true
    else
        1, false
    end
    digit = zero(if I <: Signed Int8 else UInt8 end)
    # NOTE: Don't ask me why, but it turns out that `while` is
    # considerably faster than `for` here (~7ns vs ~4ns).
    @inbounds while true
        b = bytes[i]
        digit = if b ∈ UInt8('0'):(UInt8('0') - 0x1 + min(base, 10) % UInt8)
            b - UInt8('0')
        elseif 10 < base <= 36 && (b | 0x20) ∈ UInt8('a'):(UInt8('a') - 0x1 + (base - 10) % UInt8)
            (b | 0x20) - (UInt8('a') - UInt8(10))
        elseif base > 36 && b ∈ UInt8('A'):(UInt8('z') - UInt8(62) + base % UInt)
            b - (UInt8('A') - UInt8(10)) - ifelse(b >= UInt8('a'), 0x06, 0x00)
        else
            return if X == Symbol; :invalid end
        end % if I <: Signed Int8 else UInt8 end
        numnext = muladd(widen(num), base % I, digit)
        iszero(numnext >> (8 * sizeof(I) - (I <: Signed && !negative))) || return if X == Symbol; :overflow end
        I <: Signed && negative && i == length(bytes) && break
        num = numnext % I
        i == length(bytes) && break
        i += 1
    end
    if I <: Signed && negative
        muladd(num, -(base % I), -digit)
    else
        num
    end
end

I also started looking into various writings on SWAR (SIMD within a register) parsing, reading the blogs of people like Daniel Lemire (creater of simdjson).

While reading about various methods of quickly checking and parsing integers, I implemented ~45 fast parser implementations for academic and biological itentifiers, using a few helper functions like fastparse and chopprefixes (a version of chopprefix that’s optimised for multiple chops with ASCII case-folding).

Implementing all these parsers, I found myself using a few patterns again and again, which activated my long-standing distaste for boilerplate, and love of macros. Next thing I know, I’ve started work creating a DSL for parsing identifier-like strings.

I had a lot more ideas for how to do this well than time to spend on this, so it spent a ~year about a third implemented, until Opus 4.6 released. With a clear idea of what I wanted to build, and my existing code as a model for the approach/architecture, I was able to drive Opus 4.6 to implement and test the rest of the approaches I had in mind over the past few months. The end result was a library worth spinning off from identifier parsing that I think I can be ambitious enough to say solves the problem of how to efficiently and compactly parse short identifiers: PackedParselets.jl.

Oh, and then I renamed AbstractIdentifiers.jl to FastIdentifiers.jl which now serves as a thin convenience over PackedParselets + an abstract/interface type, and reimplemented AcademicIdentifiers.jl and BioIdentifiers.jl to use it instead of hand-rolling parsers.

Capabilities

PackedParselets.jl is intended for package authors, not end-users. It defines a sexpr-like DSL that simultaneously defines the parsing and printing of a bitpacked type, e.g.

("https://github.com/JuliaLang/julia/",
 :kind(choice("issue", "pull")),
 "/",
 :num(digits(1:6)),
 optional("#issuecomment-",
          :comment(digits(10)))

The available segments are:

"<string>" for a string you want to match
optional(...) for content that may be present
choice(...) for exclusive values, which may be simple strings or entire sub-sequences
skip("<string>", choice("<strings>", ...), ...) for content that should be skipped
:<name> for declaring properties
digits(n | lo:hi) for a sequence of digits
letters(n | lo:hi) for a sequence of characters (charset convenience)
alphnum(n | lo:hi) for a sequence of digits or letters (chaset convenience)
hex(n | lo:hi) for a sequence of hexadecimal characters (charset convenience)
charset(n | lo:hi, <characters...>) for any sequence of characters
embed another PackedParselets-defined type
any custom segments your package wants to add

Let’s use this as a case study for how parsing and printing operates.

You might expect that we check for https://github.com/JuliaLang/julia/ by using startswith, but we can do much better than that. By using the input strings codeunits, and checking the length to make sure there’s enough content, we check for that prefix with five masked comparisons:

Base.unsafe_load(Ptr{UInt64}(pointer(data, pos))) & 0xffffffdfdfdfdfdf == 0x2f2f3a5350545448 &&
    Base.unsafe_load(Ptr{UInt64}(pointer(data, pos + 8))) & 0xdfffdfdfdfdfdfdf == 0x432e425548544947 &&
    Base.unsafe_load(Ptr{UInt64}(pointer(data, pos + 16)) & 0xdfdfdfdfdfffdfdf == 0x41494c554a2f4d4f) &&
    Base.unsafe_load(Ptr{UInt64}(pointer(data, pos + 24)) & 0xdfdfdfffdfdfdfdf == 0x4c554a2f474e414c) &&
    Base.unsafe_load(Ptr{UInt64}(pointer(data, pos + 32))) & 0x0000000000ffdfdf == 0x00000000002f4149

We can then use the fact that Int('i') % 2 and Int('p') % 2 differ by one to jump to a similar comparison to check for "issue" / "pull". We do a bit of light brute-force perfect hashing to look for a way to jump straight from the possible valid inputs to the verification that the entire value matches.

After "/", we need to parse the issue/pr number, which may be 1 to 6 digits. Because of the requirement that we have already seen https://...{issue,pull}/ that there are at least seven preceding bytes, and so we can jump forwards six bytes, stopping early if we hit the end of the input, and then load the preceding eight bytes into a UInt64. We can then count the number of 0-9 digits at the end, bit-mask + bit-shift the UInt64 accordingly, and parse the entire number at once with SWAR.

attr_num_avail = min(6, (nbytes - pos) + 1)
(nbytes - pos) + 1 >= 1 || return (4, pos)
attr_num_swar = htol(Base.unsafe_load(Ptr{UInt64}(pointer(data, (pos + attr_num_avail) - 8))))
attr_num_swar >>>= (8 - attr_num_avail) << 3
nondig_158 = (attr_num_swar & (attr_num_swar + 0x0606060606060606)) & 0xf0f0f0f0f0f0f0f0 ⊻ 0x3030303030303030
nondig_158 |= 0xffff000000000000
attr_num_count = trailing_zeros(((nondig_158 - 0x0101010101010101) & ~nondig_158) & 0x8080808080808080 ⊻ 0x8080808080808080) >> 3
!(iszero(attr_num_count)) || return (4, pos)
attr_num_swar <<= (8 - attr_num_count) << 3
attr_num_swar &= 0x0f0f0f0f0f0f0000
attr_num_swar = (attr_num_swar * 0x0000000000000a01) >> 8 & 0x00ff00ff00ff00ff
attr_num_swar = (attr_num_swar * 0x0000000000640001) >> 16 & 0x0000ffff0000ffff
attr_num_swar = (attr_num_swar * 0x0000271000000001) >> 32
attr_num_num = attr_num_swar % UInt32

One way of looking at this is as binary reduction:

  1   2   3   4   5   6   7   8
  ╰─┬─╯   ╰─┬─╯   ╰─┬─╯   ╰─┬─╯    ×2561, >>8
   12      34      56      78
   ╰───┬───╯        ╰───┬───╯      ×6553601, >>16
      1234            5678
       ╰───────┬────────╯          ×..., >>32
           12345678

After checking for "#issuecomment-", we take a similar approach with the ten digits, loading the input into a UInt64 and a UInt16. The loading stratergy, among other details are described in the docs. For instance, here’s how we’d grab five bytes at once depending how what we know about the larger string it must be situated within:

                forward u64
backward u64 ╭───────────┴────╮
  ╭───────┴──┼─────────╮      │
  ░ ░ ░ ░ ░ ░ a b c d e · · · ·
  ╰──parsed──╯╰─target─╯
              ╰──┬───╯╰╯
                u32   u8
                 exact

The issue/pr kind, number, and optional comment number are all packed into a 56-bit type.

Printing/stringification essentially mirrors parsing, just in reverse. At this point, I don’t think there’s much scope to further optimise parsing or printing, without AVX intrinsics.

So, what sort of performance can you expect when the codegen goes to these great lengths? Well, on my machine:

julia> using FastIdentifiers, Chairmarks

julia> @defid JuliaRepoItem ("https://github.com/JuliaLang/julia/",
        :kind(choice("issue", "pull")),
        "/",
        :num(digits(1:6)),
        optional("#issuecomment-",
                 :comment(digits(10))))

julia> r = parse(JuliaRepoItem, "https://github.com/JuliaLang/julia/pull/60526#issuecomment-3756359334")
JuliaRepoItem:https://github.com/JuliaLang/julia/pull/60526#issuecomment-3756359334

julia> r.kind
:pull

julia> r.num
60526

julia> r.comment
3756359334

julia> show(r)
JuliaRepoItem(:pull, 60526, 3756359334)

julia> randitem() = string("https://github.com/JuliaLang/julia/", ifelse(rand() < 0.5, "issue", "pull"), "/", rand(1:999999), if rand() < 0.5 "" else string("#issuecomment-", rand(1000000000:9999999999)) end)
randitem (generic function with 1 method)

julia> @b randitem() s->parse(JuliaRepoItem, s)
5.806 ns

julia> @b print(devnull, $r)
9.464 ns (1 allocs: 96 bytes)

julia> @b string($r)
36.551 ns (4 allocs: 480 bytes)

It is important to restate that PackedParselets.jl only provides the machinery for producing optimised types. FastIdentifiers.jl provides @defid, an interface for identifiers and a checkdigit segment, and out-of-the-box support for JSON.jl and JSON3.jl via StructTypes.jl and StructUtils.jl package extensions.

With this new foundation, I was able to replace the hand-rolled parsers in AcademicIdentifiers.jl and BioIdentifiers.jl almost entirely with @defid statements, with a couple of notable exceptions such as DOIs — which can’t be expressed with PackedParselets’ DSL due to their (near) unbounded size. This lead to a bunch of nice savings, the most notable of which in my mind would be transforming the ArXiv parser from ~200 loc of gnarly hand-written code that did custom bit-packing to squeeze old and new forms into a single UInt64, into a 32 loc @defid that fits into 49 bytes (rounded up to 56).

The original hand-rolled implementation

struct ArXiv <: AcademicIdentifier
    meta::UInt32 # It's this or we go over 8 bytes
    number::UInt32
end

function parseid(::Type{ArXiv}, id::SubString)
    _, id = lchopfolded(id, "https://", "http://")
    isweb, id = lchopfolded(id, "arxiv.org/")
    if isweb
        prefixend = findfirst('/', id)
        isnothing(prefixend) && return MalformedIdentifier{ArXiv}(id, "incomplete ArXiv URL")
        id = unsafe_substr(id, prefixend)
    else
        _, id = lchopfolded(id, "arxiv:")
    end
    if occursin('/', id)
        arxiv_old(id)
    else
        arxiv_new(id)
    end
end

arxiv_meta(archive::UInt8, class::UInt8, year::UInt8, month::UInt8, version::UInt16) =
    UInt32(archive) << (32 - 5) +
    UInt32(class) << (32 - 11) +
    UInt32(year) << (32 - 18) +
    UInt32(month) << (32 - 22) +
    version

arxiv_archive(arxiv::ArXiv) = (arxiv.meta >> (32 - 5)) % UInt8
arxiv_class(arxiv::ArXiv) = 0x3f & (arxiv.meta >> (32 - 11)) % UInt8
arxiv_year(arxiv::ArXiv) = 0x7f & (arxiv.meta >> (32 - 18)) % UInt8
arxiv_month(arxiv::ArXiv) = 0x0f & (arxiv.meta >> (32 - 22)) % UInt8
arxiv_version(arxiv::ArXiv) = arxiv.meta % UInt16 & 0x03ff

function arxiv_new(id::AbstractString)
    ncodeunits(id) >= 6 || return MalformedIdentifier{ArXiv}(id, "is too short to be a valid ArXiv identifier")
    bytes = codeunits(id)
    bdigit(b::UInt8) = b ∈ 0x30:0x39
    local year, month
    y1, y2, m1, m2 = @inbounds bytes[1], bytes[2], bytes[3], bytes[4]
    all(bdigit, (y1, y2)) || return MalformedIdentifier{ArXiv}(id, "year component (YYmm.nnnnn) must be an integer")
    all(bdigit, (m1, m2)) || return MalformedIdentifier{ArXiv}(id, "month component (yyMM.nnnnn) must be an integer")
    year = 0xa * (y1 - 0x30) + (y2 - 0x30)
    month = 0xa * (m1 - 0x30) + (m2 - 0x30)
    month ∈ 1:12 || return MalformedIdentifier{ArXiv}(id, "month component (yyMM.nnnnn) must be between 01 and 12")
    (@inbounds bytes[5]) == UInt8('.') || return MalformedIdentifier{ArXiv}(id, "must contain a period separating the date and number component (yymm.nnnnn)")
    i, number, version = 6, zero(UInt32), zero(UInt16)
    @inbounds while i <= ncodeunits(id)
        b = bytes[i]
        i += 1
        if (b | 0x20) == UInt8('v')
            i > ncodeunits(id) && return MalformedIdentifier{ArXiv}(id, "version component must be non-empty")
            break
        elseif bdigit(b)
            number = muladd(number, UInt32(10), b - 0x30)
        else
            return MalformedIdentifier{ArXiv}(id, "number component (yymm.NNNNN) must be an integer")
        end
    end
    number <= UInt32(99999) || return MalformedIdentifier{ArXiv}(id, "number component (yymm.NNNNN) must no more than 5 digits")
    @inbounds while i <= ncodeunits(id)
        b = bytes[i]
        if bdigit(b)
            version = muladd(version, UInt8(10), b - 0x30)
            iszero(version & ~0x03ff) || return MalformedIdentifier{ArXiv}(id, "version is larger than the maximum supported value (1023)")
        else
            return MalformedIdentifier{ArXiv}(id, "version component must be an integer")
        end
        i += 1
    end
    ArXiv(arxiv_meta(0x00, 0x00, year, month, version), number)
end

const ARXIV_OLD_ARCHIVES, ARXIV_OLD_CLASSES = let
    arxiv_catsubs = (
        "astro-ph" => ["CO", "EP", "GA", "HE", "IM", "SR"],
        "cond-mat" => ["dis-nn", "mes-hall", "mtrl-sci", "other", "quant-gas", "soft", "stat-mech", "str-el", "supr-con"],
        "cs" => ["AI", "AR", "CC", "CE", "CG", "CL", "CR", "CV", "CY", "DB", "DC", "DL", "DM", "DS", "ET",
                 "FL", "GL", "GR", "GT", "HC", "IR", "IT", "LG", "LO", "MA", "MM", "MS", "NA", "NI", "OH",
                 "OS", "PF", "PL", "RO", "SC", "SD", "SE", "SI", "SY"],
        "econ" => ["EM", "GN", "TH"],
        "eess" => ["AS", "IV", "SP", "SY"],
        "gr-qc" => String[],
        "hep-ex" => String[],
        "hep-lat" => String[],
        "hep-ph" => String[],
        "hep-th" => String[],
        "math-ph" => String[],
        "math" => ["AC", "AG", "AP", "AT", "CA", "CO", "CT", "CV", "DG", "DS", "FA", "GM", "GN", "GR", "GT",
                   "HO", "IT", "KT", "LO", "MG", "MP", "NA", "NT", "OA", "OC", "PR", "QA", "RA", "RT", "SG",
                   "SP", "ST",],
        "nlin" => ["AO", "CD", "CG", "PS", "SI"],
        "nucl-ex" => String[],
        "nucl-th" => String[],
        "physics" => ["acc-ph", "ao-ph", "app-ph", "atm-clus", "atom-ph", "bio-ph", "chem-ph", "class-ph",
                      "comp-ph", "data-an", "ed-pn", "flu-dyn", "gen-ph", "geo-ph", "hist-ph", "ins-det",
                      "med-ph", "optics", "plasm-ph", "pop-ph", "soc-ph", "space-ph"],
        "q-bio" => ["BM", "CB", "GN", "MN", "NC", "OT", "PE", "QM", "SC", "TO"],
        "q-fin" => ["CP", "EC", "GN", "MF", "PM", "PR", "RM", "ST", "SR"],
        "quant-ph" => String[],
        "stat" => ["AP", "CO", "ME", "ML", "OT", "TH"])
    map(first, arxiv_catsubs), map(last, arxiv_catsubs)
end

function arxiv_old(id::AbstractString)
    bytes = codeunits(id)
    slashpos = @something(findfirst(==(UInt8('/')), bytes),
                          return MalformedIdentifier{ArXiv}(id, "must contain a slash separating the components (archive.class/YYMMNNN)"))
    archclass = unsafe_substr(id, 0, slashpos - 1)
    numverstr = unsafe_substr(id, slashpos)
    dotpos = something(findfirst(==(UInt8('.')), view(bytes, 1:slashpos)), slashpos)
    archive = unsafe_substr(archclass, 0, dotpos - 1)
    class = unsafe_substr(archclass, dotpos, max(0, slashpos - dotpos - 1))
    archiveidx = findfirst(==(archive), ARXIV_OLD_ARCHIVES)
    isnothing(archiveidx) && return MalformedIdentifier{ArXiv}(id, "does not use a recognised ArXiv archive name")
    classidx = if isempty(class)
        0
    else
        findfirst(==(class), ARXIV_OLD_CLASSES[archiveidx])
    end
    isnothing(classidx) && return MalformedIdentifier{ArXiv}(id, "does not use a recognised ArXiv archive class")
    length(class) ∈ (0, 2) || return MalformedIdentifier{ArXiv}(id, "class component must be 2 characters")
    #--
    ncodeunits(numverstr) >= 5 || return MalformedIdentifier{ArXiv}(id, "is too short to be a valid ArXiv identifier")
    bytes = codeunits(numverstr)
    bdigit(b::UInt8) = b ∈ 0x30:0x39
    local year, month
    y1, y2, m1, m2 = @inbounds bytes[1], bytes[2], bytes[3], bytes[4]
    all(bdigit, (y1, y2)) || return MalformedIdentifier{ArXiv}(id, "year component (YYmmnnnn) must be an integer")
    all(bdigit, (m1, m2)) || return MalformedIdentifier{ArXiv}(id, "month component (yyMMnnnn) must be an integer")
    year = 0xa * (y1 - 0x30) + (y2 - 0x30)
    month = 0xa * (m1 - 0x30) + (m2 - 0x30)
    (year >= 91 || year <= 7) || return MalformedIdentifier{ArXiv}(id, "year component (YYmmnnn) must be between 91 and 07")
    month ∈ 1:12 || return MalformedIdentifier{ArXiv}(id, "month component (yyMMnnnn) must be between 01 and 12")
    i, number, version = 5, zero(UInt32), zero(UInt16)
    @inbounds while i <= ncodeunits(numverstr)
        b = bytes[i]
        i += 1
        if (b | 0x20) == UInt8('v')
            i > ncodeunits(id) && return MalformedIdentifier{ArXiv}(id, "version component must be non-empty")
            break
        elseif bdigit(b)
            number = muladd(number, UInt32(10), b - 0x30)
        else
            return MalformedIdentifier{ArXiv}(id, "number component (yymmNNNN) must be an integer")
        end
    end
    number <= UInt32(9999) || return MalformedIdentifier{ArXiv}(id, "number component (yymmNNNN) must no more than 4 digits")
    @inbounds while i <= ncodeunits(numverstr)
        b = bytes[i]
        if bdigit(b)
            version = muladd(version, UInt8(10), b - 0x30)
            iszero(version & ~0x03ff) || return MalformedIdentifier{ArXiv}(id, "version is larger than the maximum supported value (1023)")
        else
            return MalformedIdentifier{ArXiv}(id, "version component must be an integer")
        end
        i += 1
    end
    #--
    ArXiv(arxiv_meta(archiveidx % UInt8, classidx % UInt8, year, month, version), number)
end

function shortcode(io::IO, arxiv::ArXiv)
    archid, classid = arxiv_archive(arxiv), arxiv_class(arxiv)
    if !iszero(archid) # Old form
        print(io, ARXIV_OLD_ARCHIVES[archid])
        if !iszero(classid)
            print(io, '.', ARXIV_OLD_CLASSES[archid][classid])
        end
        print(io, '/')
    end
    year, month, ver = arxiv_year(arxiv), arxiv_month(arxiv), arxiv_version(arxiv)
    print(io, lpad(year, 2, '0'), lpad(month, 2, '0'))
    if iszero(archid) # New form
        print(io, '.', lpad(arxiv.number, ifelse(year >= 15, 5, 4), '0'))
    else # Old form
        print(io, lpad(arxiv.number, 3, '0'))
    end
    if ver > 0
        print(io, 'v', ver)
    end
end

idcode(arxiv::ArXiv) =
    UInt64(arxiv.meta & 0xffffff00) << 32 +
    UInt64(arxiv.number) << 8 +
    arxiv_version(arxiv)

purlprefix(::Type{ArXiv}) = "https://arxiv.org/abs/"

function Base.print(io::IO, arxiv::ArXiv)
    get(io, :limit, false) === true && get(io, :compact, false) === true ||
        print(io, "arXiv:")
    shortcode(io, arxiv)
end

The new @defid implementation

@defid(ArXiv <: AcademicIdentifier,
       (choice(:format,
            :new => seq(:year(digits(2, pad=2)),
                        :month(digits(2, min=1, max=12, pad=2)),
                        ".", :num(digits(4:5, pad=4)),
                        optional("v", :ver(digits(max=1023)))),
            :old => seq(choice(:archive,
                :astro_ph => seq("astro-ph.", :class(choice("CO", "EP", "GA", "HE", "IM", "SR"))),
                :cond_mat => seq("cond-mat.", :class(choice("dis-nn", "mes-hall", "mtrl-sci", "other", "quant-gas", "soft", "stat-mech", "str-el", "supr-con"))),
                :cs       => seq("cs.", :class(choice("AI", "AR", "CC", "CE", "CG", "CL", "CR", "CV", "CY", "DB", "DC", "DL", "DM", "DS", "ET", "FL", "GL", "GR", "GT", "HC", "IR", "IT", "LG", "LO", "MA", "MM", "MS", "NA", "NI", "OH", "OS", "PF", "PL", "RO", "SC", "SD", "SE", "SI", "SY"))),
                :econ     => seq("econ.", :class(choice("EM", "GN", "TH"))),
                :eess     => seq("eess.", :class(choice("AS", "IV", "SP", "SY"))),
                :gr_qc    => "gr-qc",
                :hep_ex   => "hep-ex",
                :hep_lat  => "hep-lat",
                :hep_ph   => "hep-ph",
                :hep_th   => "hep-th",
                :math_ph  => "math-ph",
                :math     => seq("math.", :class(choice("AC", "AG", "AP", "AT", "CA", "CO", "CT", "CV", "DG", "DS", "FA", "GM", "GN", "GR", "GT", "HO", "IT", "KT", "LO", "MG", "MP", "NA", "NT", "OA", "OC", "PR", "QA", "RA", "RT", "SG", "SP", "ST"))),
                :nlin     => seq("nlin.", :class(choice("AO", "CD", "CG", "PS", "SI"))),
                :nucl_ex  => "nucl-ex",
                :nucl_th  => "nucl-th",
                :physics  => seq("physics.", :class(choice("acc-ph", "ao-ph", "app-ph", "atm-clus", "atom-ph", "bio-ph", "chem-ph", "class-ph", "comp-ph", "data-an", "ed-pn", "flu-dyn", "gen-ph", "geo-ph", "hist-ph", "ins-det", "med-ph", "optics", "plasm-ph", "pop-ph", "soc-ph", "space-ph"))),
                :q_bio    => seq("q-bio.", :class(choice("BM", "CB", "GN", "MN", "NC", "OT", "PE", "QM", "SC", "TO"))),
                :q_fin    => seq("q-fin.", :class(choice("CP", "EC", "GN", "MF", "PM", "PR", "RM", "ST", "SR"))),
                :quant_ph => "quant-ph",
                :stat     => seq("stat.", :class(choice("AP", "CO", "ME", "ML", "OT", "TH")))),
              "/", :year(digits(2, pad=2, exclude=8:90)),
              :month(digits(2, min=1, max=12, pad=2)),
              :num(digits(3:4, pad=3)),
              optional("v", :ver(digits(max=1023)))))),
       prefix="arXiv:", purlprefix="https://arxiv.org/abs/")

Along with the reduction in loc, the @defid implementation is ~9x faster to boot, parsing ArXiv IDs in 10-15ns

While enjoying @defid, I also implement an About.jl package extension that shows you how values are represented/packed, e.g.

along with a StyledStrings extenson that’s used for 3-arg show, and error display:

This is used to provide the following academic identifiers:

ArXiv: arXiv preprint identifiers
DOI: Digital Object Identifiers
EAN13: European Article Numbers (13-digit barcodes)
ISBN: International Standard Book Numbers
ISNI: International Standard Name Identifiers
ISSN: International Standard Serial Numbers
OCN: OCLC Control Numbers
OpenAlexID: OpenAlex entity identifiers
ORCID: Open Researcher and Contributor Identifiers
PMCID: PubMed Central Identifiers
PMID: PubMed Identifiers
RAiD: Research Activity Identifiers
ROR: Research Organization Registry identifiers
VIAF: Virtual International Authority File identifiers
Wikidata: Wikidata entity identifiers

Notably, ISBN also includes hyphentation rules:

julia> parse(ISBN, "9781718502765")
ISBN:978-1-7185-0276-5

This uses one of the tricks I’m rather enjoying as of late, data .jl files that can be executed as self-updating shell scripts, see: isbn-hyphenation.jl.

Over in biology land, we provide a litany of identifiers you might encounter:

Protein & Structure

AFDB: AlphaFold Database protein structure predictions
PDB: Protein Data Bank macromolecular structures
UniProt: Universal Protein Resource sequences
UniRef: UniProt Reference Clusters
IntAct: Molecular interaction database
InterPro: Protein family/domain classification
Pfam: Protein family HMM profiles
PXD: ProteomeXchange dataset identifiers

Genomics & Genetics

ENSG, ENST, ENSP, ENSE, ENSR, ENSF, ENSFM: Ensembl genome annotations
NCBIGene: NCBI Entrez Gene database
RefSeq: NCBI Reference Sequence database
HGNC: HUGO Gene Nomenclature Committee
INSDC: International Nucleotide Sequence Database Collaboration
OMIM: Online Mendelian Inheritance in Man

Variation & Clinical

CA: ClinGen Allele Registry
ClinVar: Clinical genomic variant database
dbSNP: Single Nucleotide Polymorphism database
dbVar: Genomic structural variation database
GWAS: NHGRI-EBI GWAS Catalog studies

Expression & Functional Genomics

ArrayExpress: Functional genomics data archive
GEO: Gene Expression Omnibus

Studies & Samples

BioProject: Biological research projects
BioSample: Biological sample metadata
ClinicalTrials: ClinicalTrials.gov registry
dbGaP: Database of Genotypes and Phenotypes
EGA: European Genome-phenome Archive
RRID: Research Resource Identifiers
SRA: Sequence Read Archive

Ontologies & Controlled Vocabularies

CL: Cell Ontology
DOID: Disease Ontology
ECO: Evidence & Conclusion Ontology
EFO: Experimental Factor Ontology
GO: Gene Ontology
HPO: Human Phenotype Ontology
MeSH: Medical Subject Headings
MONDO: Monarch Disease Ontology
MP: Mammalian Phenotype Ontology
PATO: Phenotype And Trait Ontology
SO: Sequence Ontology
UBERON: Uber-anatomy Ontology

Chemical & Metabolic

ChEBI: Chemical Entities of Biological Interest
ChEMBL: Bioactive compound database
DrugBank: Drug and pharmaceutical database
HMDB: Human Metabolome Database
KEGG: Kyoto Encyclopedia of Genes and Genomes
MetaboLights: Metabolomics study archive
PubChem: Chemical compound/substance/assay database

Networks & Interactions

BioGRID: Biological interaction datasets
Reactome: Curated biological pathways
WikiPathways: Community pathway database

Cell Lines & Model Organisms

Cellosaurus: Cell line registry
FlyBase: Drosophila gene database
MGI: Mouse Genome Informatics
NCBITaxon: NCBI Taxonomy database
SGD: Saccharomyces Genome Database
WormBase: Caenorhabditis gene database

Showing off

For fun, in terms of current parsing/printing performance, for fixed bases, PackedParselets offers faster numbers than Base.

julia> @b string(rand(UInt64)) s->parse(UInt64, s)
32.227 ns

julia> @b rand(UInt64) s->print(devnull, s)
20.071 ns (2 allocs: 80 bytes)

julia> @defid MyU64 digits(UInt64)

julia> @b string(rand(UInt64)) s->parse(MyU64, s)
8.401 ns

julia> @b reinterpret(MyU64, rand(UInt64)) s->print(devnull, s)
7.526 ns (1 allocs: 48 bytes)

adienes · April 6, 2026, 3:01pm

then a PR to Base improving one aspect of the situation:

which unfortunately ended up reverted

Topic		Replies	Views
[ANN] OptParse.jl – a composable, type-stable CLI parser Package Announcements package , parsing , cli , trim	15	1368	June 28, 2026
lhe/Biofast benchmark \| FASTQ parsing [Julia,Nim,Crystal,Python,...] Biology, Health, and Medicine performance , benchmark , community , biology	54	9215	May 25, 2020
Package naming guidelines Community	57	5717	January 11, 2026
[ANN] PrettyPrint.jl: handy and extensible pretty print, also simple Package Announcements announcement	15	4438	January 4, 2022
Well over 10,000 (non-JLL) Julia packages, with JLLs up to 11,846 Julia registered packages. Congratulation Julia community! Community	40	6966	November 1, 2024