[ANN] A quartet of identifier-related parsing packages: PackedParselets.jl, FastIdentifiers.jl, AcademicIdentifiers.jl, BioIdentifiers.jl

Package Quartet: Structured and flexible Identifier types, quick to parse and print

A generic approach to time and space efficient structured representations, applied to academic and biological identifiers

or, alternatively:

How frustration with identifiers lead to a descent down the rabbit hole of optimised parsing: a personal saga

Installing

Estimated time of registration: 2026-04-08T02:07:00Z. In the meantime, you’ll want to:

pkg> add http://code.tecosaur.net/tec/PackedParselets.jl.git
pkg> add http://code.tecosaur.net/tec/FastIdentifiers.jl.git
pkg> add http://code.tecosaur.net/tec/{Academic,Bio}Identifiers.jl.git
The .git suffix is needed because LibGit2 doesn't use the headers that Git does when cloning, meaning I can't create Cloudflare rules that easily distinguish HTTP git traffic from AI scrapers trying to give my forge the hug of death.

Backstory

This set of packages has been a while in the making. About two years ago I started getting fed up with the finicky details of working with many biological identifiers across datasets. This involved:

  • Realising that string-based comparisons were both slow and fragile to stylistic differences (like upper/lower case)
  • Writing quick ad-hoc parsers for identifiers forms like HP:0100584 from strings to numbers to make comparison quicker in large table joins
  • Needing to adjust my ad-hoc parsers to handle minor format variations, case differences, etc.
  • Forgetting whether an identifier needs to be printed with a certain number of digits or not, so that I can output the right form
  • The annoyance of trying to compare identifiers like ENSG00000139618.11 and ENSG00000139618.13, where I want to ignore the optional .version suffix
  • Accidently discovering that my ad-hoc parser only works for one of multiple valid forms
  • Failing to detect actually malformed identifiers, and either experiencing strange errors or (worse) silent misbehaviour

So, I wrote a small package for parsing a bunch of bibliometric identifiers (the annoyance I was experiencing at the time my level of frustration crossed the “make a package” threshold) called AcademicIdentifiers.jl. I then wanted something similar for biological identifiers, which lead to the creation of a interface + common utilities package, neé AbstractIdentifiers.jl.

I then noticed that a few things like simply parsing integer values within identifiers took longer than expected. This led to a discussion on Julia’s Zulip around how Base’s parsing is distinctly sub-optimal with @jakobnissen (starting off with DMs on Slack), and then a PR to Base improving one aspect of the situation:

The descent into this rabbit hole started off with a simple fastparse function that was ~5x faster than Base’s tryparse (3.7ns vs 18ns to parse "000789012" on my machine):

fastparse implementation
function fastparse(::Type{Union{X, I}}, str::AbstractString, base::Integer = 10) where {X <: Union{Symbol, Nothing}, I <: Integer}
    num = zero(I)
    bytes = codeunits(str)
    isempty(bytes) && return if X == Symbol; :empty end
    i, negative = if I <: Signed && (@inbounds first(bytes)) == UInt8('-')
        2, true
    else
        1, false
    end
    digit = zero(if I <: Signed Int8 else UInt8 end)
    # NOTE: Don't ask me why, but it turns out that `while` is
    # considerably faster than `for` here (~7ns vs ~4ns).
    @inbounds while true
        b = bytes[i]
        digit = if b ∈ UInt8('0'):(UInt8('0') - 0x1 + min(base, 10) % UInt8)
            b - UInt8('0')
        elseif 10 < base <= 36 && (b | 0x20) ∈ UInt8('a'):(UInt8('a') - 0x1 + (base - 10) % UInt8)
            (b | 0x20) - (UInt8('a') - UInt8(10))
        elseif base > 36 && b ∈ UInt8('A'):(UInt8('z') - UInt8(62) + base % UInt)
            b - (UInt8('A') - UInt8(10)) - ifelse(b >= UInt8('a'), 0x06, 0x00)
        else
            return if X == Symbol; :invalid end
        end % if I <: Signed Int8 else UInt8 end
        numnext = muladd(widen(num), base % I, digit)
        iszero(numnext >> (8 * sizeof(I) - (I <: Signed && !negative))) || return if X == Symbol; :overflow end
        I <: Signed && negative && i == length(bytes) && break
        num = numnext % I
        i == length(bytes) && break
        i += 1
    end
    if I <: Signed && negative
        muladd(num, -(base % I), -digit)
    else
        num
    end
end

I also started looking into various writings on SWAR (SIMD within a register) parsing, reading the blogs of people like Daniel Lemire (creater of simdjson).

While reading about various methods of quickly checking and parsing integers, I implemented ~45 fast parser implementations for academic and biological itentifiers, using a few helper functions like fastparse and chopprefixes (a version of chopprefix that’s optimised for multiple chops with ASCII case-folding).

Implementing all these parsers, I found myself using a few patterns again and again, which activated my long-standing distaste for boilerplate, and love of macros. Next thing I know, I’ve started work creating a DSL for parsing identifier-like strings.

I had a lot more ideas for how to do this well than time to spend on this, so it spent a ~year about a third implemented, until Opus 4.6 released. With a clear idea of what I wanted to build, and my existing code as a model for the approach/architecture, I was able to drive Opus 4.6 to implement and test the rest of the approaches I had in mind over the past few months. The end result was a library worth spinning off from identifier parsing that I think I can be ambitious enough to say solves the problem of how to efficiently and compactly parse short identifiers: PackedParselets.jl.

Oh, and then I renamed AbstractIdentifiers.jl to FastIdentifiers.jl :rocket: which now serves as a thin convenience over PackedParselets + an abstract/interface type, and reimplemented AcademicIdentifiers.jl and BioIdentifiers.jl to use it instead of hand-rolling parsers.

Capabilities

PackedParselets.jl is intended for package authors, not end-users. It defines a sexpr-like DSL that simultaneously defines the parsing and printing of a bitpacked type, e.g.

("https://github.com/JuliaLang/julia/",
 :kind(choice("issue", "pull")),
 "/",
 :num(digits(1:6)),
 optional("#issuecomment-",
          :comment(digits(10)))

The available segments are:

  • "<string>" for a string you want to match
  • optional(...) for content that may be present
  • choice(...) for exclusive values, which may be simple strings or entire sub-sequences
  • skip("<string>", choice("<strings>", ...), ...) for content that should be skipped
  • :<name> for declaring properties
  • digits(n | lo:hi) for a sequence of digits
  • letters(n | lo:hi) for a sequence of characters (charset convenience)
  • alphnum(n | lo:hi) for a sequence of digits or letters (chaset convenience)
  • hex(n | lo:hi) for a sequence of hexadecimal characters (charset convenience)
  • charset(n | lo:hi, <characters...>) for any sequence of characters
  • embed another PackedParselets-defined type
  • any custom segments your package wants to add

Let’s use this as a case study for how parsing and printing operates.

You might expect that we check for https://github.com/JuliaLang/julia/ by using startswith, but we can do much better than that. By using the input strings codeunits, and checking the length to make sure there’s enough content, we check for that prefix with five masked comparisons:

Base.unsafe_load(Ptr{UInt64}(pointer(data, pos))) & 0xffffffdfdfdfdfdf == 0x2f2f3a5350545448 &&
    Base.unsafe_load(Ptr{UInt64}(pointer(data, pos + 8))) & 0xdfffdfdfdfdfdfdf == 0x432e425548544947 &&
    Base.unsafe_load(Ptr{UInt64}(pointer(data, pos + 16)) & 0xdfdfdfdfdfffdfdf == 0x41494c554a2f4d4f) &&
    Base.unsafe_load(Ptr{UInt64}(pointer(data, pos + 24)) & 0xdfdfdfffdfdfdfdf == 0x4c554a2f474e414c) &&
    Base.unsafe_load(Ptr{UInt64}(pointer(data, pos + 32))) & 0x0000000000ffdfdf == 0x00000000002f4149

We can then use the fact that Int('i') % 2 and Int('p') % 2 differ by one to jump to a similar comparison to check for "issue" / "pull". We do a bit of light brute-force perfect hashing to look for a way to jump straight from the possible valid inputs to the verification that the entire value matches.

After "/", we need to parse the issue/pr number, which may be 1 to 6 digits. Because of the requirement that we have already seen https://...{issue,pull}/ that there are at least seven preceding bytes, and so we can jump forwards six bytes, stopping early if we hit the end of the input, and then load the preceding eight bytes into a UInt64. We can then count the number of 0-9 digits at the end, bit-mask + bit-shift the UInt64 accordingly, and parse the entire number at once with SWAR.

attr_num_avail = min(6, (nbytes - pos) + 1)
(nbytes - pos) + 1 >= 1 || return (4, pos)
attr_num_swar = htol(Base.unsafe_load(Ptr{UInt64}(pointer(data, (pos + attr_num_avail) - 8))))
attr_num_swar >>>= (8 - attr_num_avail) << 3
nondig_158 = (attr_num_swar & (attr_num_swar + 0x0606060606060606)) & 0xf0f0f0f0f0f0f0f0 ⊻ 0x3030303030303030
nondig_158 |= 0xffff000000000000
attr_num_count = trailing_zeros(((nondig_158 - 0x0101010101010101) & ~nondig_158) & 0x8080808080808080 ⊻ 0x8080808080808080) >> 3
!(iszero(attr_num_count)) || return (4, pos)
attr_num_swar <<= (8 - attr_num_count) << 3
attr_num_swar &= 0x0f0f0f0f0f0f0000
attr_num_swar = (attr_num_swar * 0x0000000000000a01) >> 8 & 0x00ff00ff00ff00ff
attr_num_swar = (attr_num_swar * 0x0000000000640001) >> 16 & 0x0000ffff0000ffff
attr_num_swar = (attr_num_swar * 0x0000271000000001) >> 32
attr_num_num = attr_num_swar % UInt32

One way of looking at this is as binary reduction:

  1   2   3   4   5   6   7   8
  ╰─┬─╯   ╰─┬─╯   ╰─┬─╯   ╰─┬─╯    ×2561, >>8
   12      34      56      78
   ╰───┬───╯        ╰───┬───╯      ×6553601, >>16
      1234            5678
       ╰───────┬────────╯          ×..., >>32
           12345678

After checking for "#issuecomment-", we take a similar approach with the ten digits, loading the input into a UInt64 and a UInt16. The loading stratergy, among other details are described in the docs. For instance, here’s how we’d grab five bytes at once depending how what we know about the larger string it must be situated within:

                forward u64
backward u64 ╭───────────┴────╮
  ╭───────┴──┼─────────╮      │
  ░ ░ ░ ░ ░ ░ a b c d e · · · ·
  ╰──parsed──╯╰─target─╯
              ╰──┬───╯╰╯
                u32   u8
                 exact

The issue/pr kind, number, and optional comment number are all packed into a 56-bit type.

Printing/stringification essentially mirrors parsing, just in reverse. At this point, I don’t think there’s much scope to further optimise parsing or printing, without AVX intrinsics.

So, what sort of performance can you expect when the codegen goes to these great lengths? Well, on my machine:

julia> using FastIdentifiers, Chairmarks

julia> @defid JuliaRepoItem ("https://github.com/JuliaLang/julia/",
        :kind(choice("issue", "pull")),
        "/",
        :num(digits(1:6)),
        optional("#issuecomment-",
                 :comment(digits(10))))

julia> r = parse(JuliaRepoItem, "https://github.com/JuliaLang/julia/pull/60526#issuecomment-3756359334")
JuliaRepoItem:https://github.com/JuliaLang/julia/pull/60526#issuecomment-3756359334

julia> r.kind
:pull

julia> r.num
60526

julia> r.comment
3756359334

julia> show(r)
JuliaRepoItem(:pull, 60526, 3756359334)

julia> randitem() = string("https://github.com/JuliaLang/julia/", ifelse(rand() < 0.5, "issue", "pull"), "/", rand(1:999999), if rand() < 0.5 "" else string("#issuecomment-", rand(1000000000:9999999999)) end)
randitem (generic function with 1 method)

julia> @b randitem() s->parse(JuliaRepoItem, s)
5.806 ns

julia> @b print(devnull, $r)
9.464 ns (1 allocs: 96 bytes)

julia> @b string($r)
36.551 ns (4 allocs: 480 bytes)

It is important to restate that PackedParselets.jl only provides the machinery for producing optimised types. FastIdentifiers.jl provides @defid, an interface for identifiers and a checkdigit segment, and out-of-the-box support for JSON.jl and JSON3.jl via StructTypes.jl and StructUtils.jl package extensions.

With this new foundation, I was able to replace the hand-rolled parsers in AcademicIdentifiers.jl and BioIdentifiers.jl almost entirely with @defid statements, with a couple of notable exceptions such as DOIs — which can’t be expressed with PackedParselets’ DSL due to their (near) unbounded size. This lead to a bunch of nice savings, the most notable of which in my mind would be transforming the ArXiv parser from ~200 loc of gnarly hand-written code that did custom bit-packing to squeeze old and new forms into a single UInt64, into a 32 loc @defid that fits into 49 bytes (rounded up to 56).

The original hand-rolled implementation
struct ArXiv <: AcademicIdentifier
    meta::UInt32 # It's this or we go over 8 bytes
    number::UInt32
end

function parseid(::Type{ArXiv}, id::SubString)
    _, id = lchopfolded(id, "https://", "http://")
    isweb, id = lchopfolded(id, "arxiv.org/")
    if isweb
        prefixend = findfirst('/', id)
        isnothing(prefixend) && return MalformedIdentifier{ArXiv}(id, "incomplete ArXiv URL")
        id = unsafe_substr(id, prefixend)
    else
        _, id = lchopfolded(id, "arxiv:")
    end
    if occursin('/', id)
        arxiv_old(id)
    else
        arxiv_new(id)
    end
end

arxiv_meta(archive::UInt8, class::UInt8, year::UInt8, month::UInt8, version::UInt16) =
    UInt32(archive) << (32 - 5) +
    UInt32(class) << (32 - 11) +
    UInt32(year) << (32 - 18) +
    UInt32(month) << (32 - 22) +
    version

arxiv_archive(arxiv::ArXiv) = (arxiv.meta >> (32 - 5)) % UInt8
arxiv_class(arxiv::ArXiv) = 0x3f & (arxiv.meta >> (32 - 11)) % UInt8
arxiv_year(arxiv::ArXiv) = 0x7f & (arxiv.meta >> (32 - 18)) % UInt8
arxiv_month(arxiv::ArXiv) = 0x0f & (arxiv.meta >> (32 - 22)) % UInt8
arxiv_version(arxiv::ArXiv) = arxiv.meta % UInt16 & 0x03ff

function arxiv_new(id::AbstractString)
    ncodeunits(id) >= 6 || return MalformedIdentifier{ArXiv}(id, "is too short to be a valid ArXiv identifier")
    bytes = codeunits(id)
    bdigit(b::UInt8) = b ∈ 0x30:0x39
    local year, month
    y1, y2, m1, m2 = @inbounds bytes[1], bytes[2], bytes[3], bytes[4]
    all(bdigit, (y1, y2)) || return MalformedIdentifier{ArXiv}(id, "year component (YYmm.nnnnn) must be an integer")
    all(bdigit, (m1, m2)) || return MalformedIdentifier{ArXiv}(id, "month component (yyMM.nnnnn) must be an integer")
    year = 0xa * (y1 - 0x30) + (y2 - 0x30)
    month = 0xa * (m1 - 0x30) + (m2 - 0x30)
    month ∈ 1:12 || return MalformedIdentifier{ArXiv}(id, "month component (yyMM.nnnnn) must be between 01 and 12")
    (@inbounds bytes[5]) == UInt8('.') || return MalformedIdentifier{ArXiv}(id, "must contain a period separating the date and number component (yymm.nnnnn)")
    i, number, version = 6, zero(UInt32), zero(UInt16)
    @inbounds while i <= ncodeunits(id)
        b = bytes[i]
        i += 1
        if (b | 0x20) == UInt8('v')
            i > ncodeunits(id) && return MalformedIdentifier{ArXiv}(id, "version component must be non-empty")
            break
        elseif bdigit(b)
            number = muladd(number, UInt32(10), b - 0x30)
        else
            return MalformedIdentifier{ArXiv}(id, "number component (yymm.NNNNN) must be an integer")
        end
    end
    number <= UInt32(99999) || return MalformedIdentifier{ArXiv}(id, "number component (yymm.NNNNN) must no more than 5 digits")
    @inbounds while i <= ncodeunits(id)
        b = bytes[i]
        if bdigit(b)
            version = muladd(version, UInt8(10), b - 0x30)
            iszero(version & ~0x03ff) || return MalformedIdentifier{ArXiv}(id, "version is larger than the maximum supported value (1023)")
        else
            return MalformedIdentifier{ArXiv}(id, "version component must be an integer")
        end
        i += 1
    end
    ArXiv(arxiv_meta(0x00, 0x00, year, month, version), number)
end

const ARXIV_OLD_ARCHIVES, ARXIV_OLD_CLASSES = let
    arxiv_catsubs = (
        "astro-ph" => ["CO", "EP", "GA", "HE", "IM", "SR"],
        "cond-mat" => ["dis-nn", "mes-hall", "mtrl-sci", "other", "quant-gas", "soft", "stat-mech", "str-el", "supr-con"],
        "cs" => ["AI", "AR", "CC", "CE", "CG", "CL", "CR", "CV", "CY", "DB", "DC", "DL", "DM", "DS", "ET",
                 "FL", "GL", "GR", "GT", "HC", "IR", "IT", "LG", "LO", "MA", "MM", "MS", "NA", "NI", "OH",
                 "OS", "PF", "PL", "RO", "SC", "SD", "SE", "SI", "SY"],
        "econ" => ["EM", "GN", "TH"],
        "eess" => ["AS", "IV", "SP", "SY"],
        "gr-qc" => String[],
        "hep-ex" => String[],
        "hep-lat" => String[],
        "hep-ph" => String[],
        "hep-th" => String[],
        "math-ph" => String[],
        "math" => ["AC", "AG", "AP", "AT", "CA", "CO", "CT", "CV", "DG", "DS", "FA", "GM", "GN", "GR", "GT",
                   "HO", "IT", "KT", "LO", "MG", "MP", "NA", "NT", "OA", "OC", "PR", "QA", "RA", "RT", "SG",
                   "SP", "ST",],
        "nlin" => ["AO", "CD", "CG", "PS", "SI"],
        "nucl-ex" => String[],
        "nucl-th" => String[],
        "physics" => ["acc-ph", "ao-ph", "app-ph", "atm-clus", "atom-ph", "bio-ph", "chem-ph", "class-ph",
                      "comp-ph", "data-an", "ed-pn", "flu-dyn", "gen-ph", "geo-ph", "hist-ph", "ins-det",
                      "med-ph", "optics", "plasm-ph", "pop-ph", "soc-ph", "space-ph"],
        "q-bio" => ["BM", "CB", "GN", "MN", "NC", "OT", "PE", "QM", "SC", "TO"],
        "q-fin" => ["CP", "EC", "GN", "MF", "PM", "PR", "RM", "ST", "SR"],
        "quant-ph" => String[],
        "stat" => ["AP", "CO", "ME", "ML", "OT", "TH"])
    map(first, arxiv_catsubs), map(last, arxiv_catsubs)
end

function arxiv_old(id::AbstractString)
    bytes = codeunits(id)
    slashpos = @something(findfirst(==(UInt8('/')), bytes),
                          return MalformedIdentifier{ArXiv}(id, "must contain a slash separating the components (archive.class/YYMMNNN)"))
    archclass = unsafe_substr(id, 0, slashpos - 1)
    numverstr = unsafe_substr(id, slashpos)
    dotpos = something(findfirst(==(UInt8('.')), view(bytes, 1:slashpos)), slashpos)
    archive = unsafe_substr(archclass, 0, dotpos - 1)
    class = unsafe_substr(archclass, dotpos, max(0, slashpos - dotpos - 1))
    archiveidx = findfirst(==(archive), ARXIV_OLD_ARCHIVES)
    isnothing(archiveidx) && return MalformedIdentifier{ArXiv}(id, "does not use a recognised ArXiv archive name")
    classidx = if isempty(class)
        0
    else
        findfirst(==(class), ARXIV_OLD_CLASSES[archiveidx])
    end
    isnothing(classidx) && return MalformedIdentifier{ArXiv}(id, "does not use a recognised ArXiv archive class")
    length(class) ∈ (0, 2) || return MalformedIdentifier{ArXiv}(id, "class component must be 2 characters")
    #--
    ncodeunits(numverstr) >= 5 || return MalformedIdentifier{ArXiv}(id, "is too short to be a valid ArXiv identifier")
    bytes = codeunits(numverstr)
    bdigit(b::UInt8) = b ∈ 0x30:0x39
    local year, month
    y1, y2, m1, m2 = @inbounds bytes[1], bytes[2], bytes[3], bytes[4]
    all(bdigit, (y1, y2)) || return MalformedIdentifier{ArXiv}(id, "year component (YYmmnnnn) must be an integer")
    all(bdigit, (m1, m2)) || return MalformedIdentifier{ArXiv}(id, "month component (yyMMnnnn) must be an integer")
    year = 0xa * (y1 - 0x30) + (y2 - 0x30)
    month = 0xa * (m1 - 0x30) + (m2 - 0x30)
    (year >= 91 || year <= 7) || return MalformedIdentifier{ArXiv}(id, "year component (YYmmnnn) must be between 91 and 07")
    month ∈ 1:12 || return MalformedIdentifier{ArXiv}(id, "month component (yyMMnnnn) must be between 01 and 12")
    i, number, version = 5, zero(UInt32), zero(UInt16)
    @inbounds while i <= ncodeunits(numverstr)
        b = bytes[i]
        i += 1
        if (b | 0x20) == UInt8('v')
            i > ncodeunits(id) && return MalformedIdentifier{ArXiv}(id, "version component must be non-empty")
            break
        elseif bdigit(b)
            number = muladd(number, UInt32(10), b - 0x30)
        else
            return MalformedIdentifier{ArXiv}(id, "number component (yymmNNNN) must be an integer")
        end
    end
    number <= UInt32(9999) || return MalformedIdentifier{ArXiv}(id, "number component (yymmNNNN) must no more than 4 digits")
    @inbounds while i <= ncodeunits(numverstr)
        b = bytes[i]
        if bdigit(b)
            version = muladd(version, UInt8(10), b - 0x30)
            iszero(version & ~0x03ff) || return MalformedIdentifier{ArXiv}(id, "version is larger than the maximum supported value (1023)")
        else
            return MalformedIdentifier{ArXiv}(id, "version component must be an integer")
        end
        i += 1
    end
    #--
    ArXiv(arxiv_meta(archiveidx % UInt8, classidx % UInt8, year, month, version), number)
end

function shortcode(io::IO, arxiv::ArXiv)
    archid, classid = arxiv_archive(arxiv), arxiv_class(arxiv)
    if !iszero(archid) # Old form
        print(io, ARXIV_OLD_ARCHIVES[archid])
        if !iszero(classid)
            print(io, '.', ARXIV_OLD_CLASSES[archid][classid])
        end
        print(io, '/')
    end
    year, month, ver = arxiv_year(arxiv), arxiv_month(arxiv), arxiv_version(arxiv)
    print(io, lpad(year, 2, '0'), lpad(month, 2, '0'))
    if iszero(archid) # New form
        print(io, '.', lpad(arxiv.number, ifelse(year >= 15, 5, 4), '0'))
    else # Old form
        print(io, lpad(arxiv.number, 3, '0'))
    end
    if ver > 0
        print(io, 'v', ver)
    end
end

idcode(arxiv::ArXiv) =
    UInt64(arxiv.meta & 0xffffff00) << 32 +
    UInt64(arxiv.number) << 8 +
    arxiv_version(arxiv)

purlprefix(::Type{ArXiv}) = "https://arxiv.org/abs/"

function Base.print(io::IO, arxiv::ArXiv)
    get(io, :limit, false) === true && get(io, :compact, false) === true ||
        print(io, "arXiv:")
    shortcode(io, arxiv)
end
The new @defid implementation
@defid(ArXiv <: AcademicIdentifier,
       (choice(:format,
            :new => seq(:year(digits(2, pad=2)),
                        :month(digits(2, min=1, max=12, pad=2)),
                        ".", :num(digits(4:5, pad=4)),
                        optional("v", :ver(digits(max=1023)))),
            :old => seq(choice(:archive,
                :astro_ph => seq("astro-ph.", :class(choice("CO", "EP", "GA", "HE", "IM", "SR"))),
                :cond_mat => seq("cond-mat.", :class(choice("dis-nn", "mes-hall", "mtrl-sci", "other", "quant-gas", "soft", "stat-mech", "str-el", "supr-con"))),
                :cs       => seq("cs.", :class(choice("AI", "AR", "CC", "CE", "CG", "CL", "CR", "CV", "CY", "DB", "DC", "DL", "DM", "DS", "ET", "FL", "GL", "GR", "GT", "HC", "IR", "IT", "LG", "LO", "MA", "MM", "MS", "NA", "NI", "OH", "OS", "PF", "PL", "RO", "SC", "SD", "SE", "SI", "SY"))),
                :econ     => seq("econ.", :class(choice("EM", "GN", "TH"))),
                :eess     => seq("eess.", :class(choice("AS", "IV", "SP", "SY"))),
                :gr_qc    => "gr-qc",
                :hep_ex   => "hep-ex",
                :hep_lat  => "hep-lat",
                :hep_ph   => "hep-ph",
                :hep_th   => "hep-th",
                :math_ph  => "math-ph",
                :math     => seq("math.", :class(choice("AC", "AG", "AP", "AT", "CA", "CO", "CT", "CV", "DG", "DS", "FA", "GM", "GN", "GR", "GT", "HO", "IT", "KT", "LO", "MG", "MP", "NA", "NT", "OA", "OC", "PR", "QA", "RA", "RT", "SG", "SP", "ST"))),
                :nlin     => seq("nlin.", :class(choice("AO", "CD", "CG", "PS", "SI"))),
                :nucl_ex  => "nucl-ex",
                :nucl_th  => "nucl-th",
                :physics  => seq("physics.", :class(choice("acc-ph", "ao-ph", "app-ph", "atm-clus", "atom-ph", "bio-ph", "chem-ph", "class-ph", "comp-ph", "data-an", "ed-pn", "flu-dyn", "gen-ph", "geo-ph", "hist-ph", "ins-det", "med-ph", "optics", "plasm-ph", "pop-ph", "soc-ph", "space-ph"))),
                :q_bio    => seq("q-bio.", :class(choice("BM", "CB", "GN", "MN", "NC", "OT", "PE", "QM", "SC", "TO"))),
                :q_fin    => seq("q-fin.", :class(choice("CP", "EC", "GN", "MF", "PM", "PR", "RM", "ST", "SR"))),
                :quant_ph => "quant-ph",
                :stat     => seq("stat.", :class(choice("AP", "CO", "ME", "ML", "OT", "TH")))),
              "/", :year(digits(2, pad=2, exclude=8:90)),
              :month(digits(2, min=1, max=12, pad=2)),
              :num(digits(3:4, pad=3)),
              optional("v", :ver(digits(max=1023)))))),
       prefix="arXiv:", purlprefix="https://arxiv.org/abs/")

Along with the reduction in loc, the @defid implementation is ~9x faster to boot, parsing ArXiv IDs in 10-15ns :grinning:

While enjoying @defid, I also implement an About.jl package extension that shows you how values are represented/packed, e.g.

along with a StyledStrings extenson that’s used for 3-arg show, and error display:

This is used to provide the following academic identifiers:

  • ArXiv: arXiv preprint identifiers
  • DOI: Digital Object Identifiers
  • EAN13: European Article Numbers (13-digit barcodes)
  • ISBN: International Standard Book Numbers
  • ISNI: International Standard Name Identifiers
  • ISSN: International Standard Serial Numbers
  • OCN: OCLC Control Numbers
  • OpenAlexID: OpenAlex entity identifiers
  • ORCID: Open Researcher and Contributor Identifiers
  • PMCID: PubMed Central Identifiers
  • PMID: PubMed Identifiers
  • RAiD: Research Activity Identifiers
  • ROR: Research Organization Registry identifiers
  • VIAF: Virtual International Authority File identifiers
  • Wikidata: Wikidata entity identifiers

Notably, ISBN also includes hyphentation rules:

julia> parse(ISBN, "9781718502765")
ISBN:978-1-7185-0276-5

This uses one of the tricks I’m rather enjoying as of late, data .jl files that can be executed as self-updating shell scripts, see: isbn-hyphenation.jl.

Over in biology land, we provide a litany of identifiers you might encounter:

Protein & Structure

  • AFDB: AlphaFold Database protein structure predictions
  • PDB: Protein Data Bank macromolecular structures
  • UniProt: Universal Protein Resource sequences
  • UniRef: UniProt Reference Clusters
  • IntAct: Molecular interaction database
  • InterPro: Protein family/domain classification
  • Pfam: Protein family HMM profiles
  • PXD: ProteomeXchange dataset identifiers

Genomics & Genetics

  • ENSG, ENST, ENSP, ENSE, ENSR, ENSF, ENSFM: Ensembl genome annotations
  • NCBIGene: NCBI Entrez Gene database
  • RefSeq: NCBI Reference Sequence database
  • HGNC: HUGO Gene Nomenclature Committee
  • INSDC: International Nucleotide Sequence Database Collaboration
  • OMIM: Online Mendelian Inheritance in Man

Variation & Clinical

  • CA: ClinGen Allele Registry
  • ClinVar: Clinical genomic variant database
  • dbSNP: Single Nucleotide Polymorphism database
  • dbVar: Genomic structural variation database
  • GWAS: NHGRI-EBI GWAS Catalog studies

Expression & Functional Genomics

  • ArrayExpress: Functional genomics data archive
  • GEO: Gene Expression Omnibus

Studies & Samples

  • BioProject: Biological research projects
  • BioSample: Biological sample metadata
  • ClinicalTrials: ClinicalTrials.gov registry
  • dbGaP: Database of Genotypes and Phenotypes
  • EGA: European Genome-phenome Archive
  • RRID: Research Resource Identifiers
  • SRA: Sequence Read Archive

Ontologies & Controlled Vocabularies

  • CL: Cell Ontology
  • DOID: Disease Ontology
  • ECO: Evidence & Conclusion Ontology
  • EFO: Experimental Factor Ontology
  • GO: Gene Ontology
  • HPO: Human Phenotype Ontology
  • MeSH: Medical Subject Headings
  • MONDO: Monarch Disease Ontology
  • MP: Mammalian Phenotype Ontology
  • PATO: Phenotype And Trait Ontology
  • SO: Sequence Ontology
  • UBERON: Uber-anatomy Ontology

Chemical & Metabolic

  • ChEBI: Chemical Entities of Biological Interest
  • ChEMBL: Bioactive compound database
  • DrugBank: Drug and pharmaceutical database
  • HMDB: Human Metabolome Database
  • KEGG: Kyoto Encyclopedia of Genes and Genomes
  • MetaboLights: Metabolomics study archive
  • PubChem: Chemical compound/substance/assay database

Networks & Interactions

  • BioGRID: Biological interaction datasets
  • Reactome: Curated biological pathways
  • WikiPathways: Community pathway database

Cell Lines & Model Organisms

  • Cellosaurus: Cell line registry
  • FlyBase: Drosophila gene database
  • MGI: Mouse Genome Informatics
  • NCBITaxon: NCBI Taxonomy database
  • SGD: Saccharomyces Genome Database
  • WormBase: Caenorhabditis gene database

Showing off

For fun, in terms of current parsing/printing performance, for fixed bases, PackedParselets offers faster numbers than Base.

julia> @b string(rand(UInt64)) s->parse(UInt64, s)
32.227 ns

julia> @b rand(UInt64) s->print(devnull, s)
20.071 ns (2 allocs: 80 bytes)

julia> @defid MyU64 digits(UInt64)

julia> @b string(rand(UInt64)) s->parse(MyU64, s)
8.401 ns

julia> @b reinterpret(MyU64, rand(UInt64)) s->print(devnull, s)
7.526 ns (1 allocs: 48 bytes)
11 Likes