[ANN] BioToolkit.jl: A high-performance biocomputing package

It’s great to have more people doing biology with julia! Some of the rest of this post may seem like criticism, so I want to say up front that it’s not - julia is a great language for people that like to “roll their own”, and you are welcome to find value wherever you like.

That said, it would also be useful to have a discussion about the limitations that you found in the rest of the ecosystem, and if you’ve truly found performance benefits, how we might share those with the rest of the community.

For example, you say you have 5x faster BED parsing, but you also have BED.jl in your deps. Can you upstream the performance improvements you’ve made? You mention speedups for Hamming distance and NJ tree relative to BioPython, but have you compared to the existing julia implementations?

I can’t help but notice that your first commit was two days ago, and you have tens of thousands of lines of code. Of course, a lot of us are using AI for coding these days and there’s nothing inherently wrong with that, but I am a bit skeptical that you’re going to come up with the best implementation idea using this approach compared to folks that have spent huge amounts of time implementing. For example, you have single cell analysis code that materializes dense matrices everywhere instead of taking advantage of the amazing work Rasmus has done in SingleCellProjections.jl.

There may be value in a monolithic package that pulls a lot of functionality in - certainly many biologists expect that. In BioJulia we decided to move away from this model as @jonathanBieler mentioned, in part just because of the maintenance burden, but also for the reasons @Benny mentioned. Others are free to disagree of course, but I think there would be a lot more value in

  1. developing docs or tutorials that show how to perform the analyses with existing julia packages
  2. if you truly have innovations and performance gains, upstream those to appropriate packages
  3. If you feel you have something that doesn’t fit 1 or 2, have a discussion with those of use that know the ecosystem, and see if there’s another perspective that you (or your AI agents) haven’t thought of.

Just for fun, I asked claude to analyze the package and see what already exists in the julia ecosystem. It is not flattering

This package is almost entirely a reimplementation of the Julia bioinformatics ecosystem

The scale of duplication is remarkable. Nearly every module reinvents something that already exists, often worse.

Claude's analysis of duplications

sequence.jl (~1500 lines)

Reimplements BioJulia’s BioSequences.jl from scratch. reverse_complement, gc_content, translate_dna, transcribe_dna, k-mer counting, FASTA/FASTQ parsing — all of these exist in BioSequences.jl + FASTX.jl, with proper parametric sequence types, compile-time alphabet encoding, and correct handling of all IUPAC codes. The custom FASTQ parser here reinvents FASTX.jl’s reader. The FASTA random-access indexing reimplements what BioSequences.jl + FASTA.jl handle properly. Even the Hamming distance bit-trick is available in BioAlignments.jl.

align.jl (~1740 lines)

Reimplements pairwise alignment (Smith-Waterman, Needleman-Wunsch) and hardcodes BLOSUM62 and other substitution matrices as text strings to be parsed at runtime. BioAlignments.jl does all of this, correctly, with proper affine gap penalties and a full substitution matrix library via BioSymbols.jl.

msa.jl (~1476 lines)

Reimplements multiple sequence alignment handling: a MultipleSequenceAlignment struct, column annotations, consensus sequences, conservation scoring, gap masking, phylogenetic weighting. MolecularEvolution.jl and BioJulia’s MSA packages cover this. The Clustal/MUSCLE interface via external process calls reimplements what Automa.jl-based parsers already handle.

phylo.jl (~1912 lines)

Reimplements neighbor-joining, UPGMA, Newick/Nexus/PhyloXML/NeXML parsing and writing, Robinson-Foulds distance, bootstrap consensus trees, parsimony scoring, and substitution models (JC69, K80, HKY85) with Felsenstein pruning. Phylo.jl handles trees, MolecularEvolution.jl handles substitution models and likelihood. The custom PhyloTree struct is a mutable linked tree with a Dict{String,String} metadata field — significantly less capable than established packages.

hmm.jl (~210 lines)

Reimplements Viterbi, forward, and backward algorithms. HMMBase.jl and HiddenMarkovModels.jl exist and are well-tested.

differentialexpression.jl (~769 lines)

Reimplements DESeq2-style analysis: negative binomial GLM fitting, dispersion estimation, LFC shrinkage, VST, ComBat batch correction, surrogate variable estimation. DESeq2.jl wraps the canonical R implementation. The GLM fitting here rolls its own IRLS rather than using GLM.jl.

genomicranges.jl (~796 lines)

Reimplements genomic interval operations: find_overlaps, coverage, flank, promoters, nearest, etc. GenomicFeatures.jl and IntervalTrees.jl do this with proper interval tree indexing rather than the sorted-array approach here.

bam.jl (~1080 lines)

Reimplements BAM reading. XAM.jl is the canonical Julia BAM/SAM library. This is one of the higher-risk reimplementations — BAM parsing is tricky to get right.

popgen.jl (~2026 lines)

Reimplements population genetics: HWE tests, FST/Fst/GST, AMOVA, LD mapping, PCoA, coalescent simulations, etc. PopGen.jl is a mature BioJulia package for exactly this.

structure.jl (~3035 lines, the largest file)

Reimplements PDB/mmCIF parsing, coordinate manipulation, RMSD, structure superposition, secondary structure assignment, and more. BioStructures.jl is a well-maintained Julia package for exactly this, including Ramachandran analysis and PDB downloading.

gwas.jl (~647 lines)

Reimplements PLINK file reading, linear/mixed model GWAS scans, PRS calculation, and LD clumping. JWAS.jl and SnpArrays.jl cover this properly.

microbiome.jl, proteomics.jl, clinical.jl

Implement Bray-Curtis, UniFrac, PCoA, Cox PH regression, Kaplan-Meier, mass spec peak detection. All have established Julia packages: MicrobiomeAnalysis.jl, Survival.jl, ROC.jl.

Summary

The entire package is essentially a monolithic reinvention of BioJulia and related packages, wrapped in a single namespace. Almost none of it uses BioSequences, BioAlignments, BioStructures, GenomicFeatures, XAM, Phylo.jl, PopGen.jl, or any of the other established Julia bioinformatics libraries it overlaps with.

The reimplementations consistently lack the correctness properties of the originals: no parametric type safety on biological alphabets, no Automa-based parsers, no interval trees for genomic overlap queries, no proper affine gap handling, plain Dict and String representations where typed sequence types would be appropriate. Several files contain explicit # Bug fix: comments, which suggests the code was generated and then partially debugged rather than written by someone with domain expertise.

The comparison files in Examples/Comparison/ are revealing — they suggest the package was designed to demonstrate feature parity with Biopython rather than integrate with the Julia ecosystem.

8 Likes