[ANN] BioToolkit.jl: A high-performance biocomputing package

I am pleased to announce the initial release of BioToolkit.jl, a bioinformatics toolkit for Julia. It provides a unified, Biopython-inspired API for sequence analysis, structural biology, phylogenetics, and population genetics.

While the Julia ecosystem has excellent modular packages, users often have to work with multiple dependencies (BioSequences, XAM, etc.). BioToolkit aims to provide a complete experience with a focus on performance and ease of use, aiming to be a drop-in solution for common bioinformatics pipelines. This package was originally intended to be a replacement of biopython but it also contains some features from R’s biocomputing library (Bioconductor) as well. The main advantage over R’s biocomputing library is use of multi-threading and CUDA which can be used for certain functions with O(N^2) complexity, this should make BioToolkit.jl faster however that has not been tested yet.

Performance Comparison

It features original implementations optimized for speed. Preliminary benchmarks show significant improvements over Python baselines and competitive performance within the Julia ecosystem:

  • Hamming Distance: Upto 328× faster than Biopython.
  • BED Parsing: 5× faster than existing Julia parsers.
  • Phylogenetics (NJ Tree): 168× faster than Biopython.
  • GPU Acceleration: Native CUDA support for k-mer counting and coverage histograms.

(More benchmarks are available in the repository along with code used to benchmark it against biopython).

Feature Highlights

It has API for:

  • Sequence: Transcription, translation, codon usage (CAI), k-mer analysis.
  • Structure: PDB/mmCIF parsing, superposition (Kabsch), SASA, contact maps.
  • Omics: Differential expression, single-cell workflows, GWAS, epigenomics.
  • I/O: Readers for FASTA, FASTQ, BAM, BED, GFF, VCF, and GenBank.

and many more.

Installation

The package is currently unregistered but can be installed via:

julia
using Pkg
Pkg.add(url="https://github.com/Aditya747S/BioToolkit.jl.git")

Links

As this is an initial development release (v0.1.0-dev), there can be rough edges, all suggestions and criticisms are welcome.

9 Likes

FYI Bio.jl also started as a all-in one package, but was split later one into multiple ones (maybe someone else can explain why).

In general it would be better to reuse existing libraries when possible (BAM stuff → XAM.jl, HMM → HiddenMarkovModels.jl, …). As you say :

This approach reduces the complexity of managing multiple dependencies and ensures seamless integration between different modules. Also this means that users dont need to change the formats/datatypes when switching between different functionalities, which can be a common source of errors and inefficiencies in multi-package ecosystems like R’s biocomputing library.

But ironically as a user I have to deal with your BamRecord and XAM’s BAM.Record.

Nice!

In PDBTools.jl we have a SASA implementation that appears to be faster than what you have there. It might be relatively straighforward to move it to an independent package such that the infraestructure can be shared, if you think that is worthwhile:

julia> s = BioToolkit.read_mmcif("./7KCR.cif")
Structure(7KCR, models=1)

julia> @time BioToolkit.structure_sasa(s)
  9.164858 seconds (270.15 M allocations: 10.008 GiB, 17.59% gc time)
4.419359635171172e6
julia> using PDBTools

julia> s = read_mmcif("7KCR.cif");

julia> @time sasa_particles(s)
  4.205989 seconds (8.50 M allocations: 1.133 GiB, 5.93% gc time)
PDBTools.SASA{3, Vector{Atom{Nothing}}}
    Number of particles: 900300
    Total SASA: 4.420882e6
    Output of dots: false 

A Julia package and any submodules are loaded all at once. For smaller packages, that could be fine, but larger packages will cause a noticeable import latency. If you only need one submodule or even just one function of it, importing all the others is just a waste of time. So, there’s a tendency for packages to be split into multiple packages when they reach a certain scale. Other languages can have different ways to partially load monolithic packages; for example, import scipy actually imports very little, and users are instructed to instead explicitly import the substantial submodules, or as of 1.9+, lazily load them by attribute name.

2 Likes

It’s great to have more people doing biology with julia! Some of the rest of this post may seem like criticism, so I want to say up front that it’s not - julia is a great language for people that like to “roll their own”, and you are welcome to find value wherever you like.

That said, it would also be useful to have a discussion about the limitations that you found in the rest of the ecosystem, and if you’ve truly found performance benefits, how we might share those with the rest of the community.

For example, you say you have 5x faster BED parsing, but you also have BED.jl in your deps. Can you upstream the performance improvements you’ve made? You mention speedups for Hamming distance and NJ tree relative to BioPython, but have you compared to the existing julia implementations?

I can’t help but notice that your first commit was two days ago, and you have tens of thousands of lines of code. Of course, a lot of us are using AI for coding these days and there’s nothing inherently wrong with that, but I am a bit skeptical that you’re going to come up with the best implementation idea using this approach compared to folks that have spent huge amounts of time implementing. For example, you have single cell analysis code that materializes dense matrices everywhere instead of taking advantage of the amazing work Rasmus has done in SingleCellProjections.jl.

There may be value in a monolithic package that pulls a lot of functionality in - certainly many biologists expect that. In BioJulia we decided to move away from this model as @jonathanBieler mentioned, in part just because of the maintenance burden, but also for the reasons @Benny mentioned. Others are free to disagree of course, but I think there would be a lot more value in

  1. developing docs or tutorials that show how to perform the analyses with existing julia packages
  2. if you truly have innovations and performance gains, upstream those to appropriate packages
  3. If you feel you have something that doesn’t fit 1 or 2, have a discussion with those of use that know the ecosystem, and see if there’s another perspective that you (or your AI agents) haven’t thought of.

Just for fun, I asked claude to analyze the package and see what already exists in the julia ecosystem. It is not flattering

This package is almost entirely a reimplementation of the Julia bioinformatics ecosystem

The scale of duplication is remarkable. Nearly every module reinvents something that already exists, often worse.

Claude's analysis of duplications

sequence.jl (~1500 lines)

Reimplements BioJulia’s BioSequences.jl from scratch. reverse_complement, gc_content, translate_dna, transcribe_dna, k-mer counting, FASTA/FASTQ parsing — all of these exist in BioSequences.jl + FASTX.jl, with proper parametric sequence types, compile-time alphabet encoding, and correct handling of all IUPAC codes. The custom FASTQ parser here reinvents FASTX.jl’s reader. The FASTA random-access indexing reimplements what BioSequences.jl + FASTA.jl handle properly. Even the Hamming distance bit-trick is available in BioAlignments.jl.

align.jl (~1740 lines)

Reimplements pairwise alignment (Smith-Waterman, Needleman-Wunsch) and hardcodes BLOSUM62 and other substitution matrices as text strings to be parsed at runtime. BioAlignments.jl does all of this, correctly, with proper affine gap penalties and a full substitution matrix library via BioSymbols.jl.

msa.jl (~1476 lines)

Reimplements multiple sequence alignment handling: a MultipleSequenceAlignment struct, column annotations, consensus sequences, conservation scoring, gap masking, phylogenetic weighting. MolecularEvolution.jl and BioJulia’s MSA packages cover this. The Clustal/MUSCLE interface via external process calls reimplements what Automa.jl-based parsers already handle.

phylo.jl (~1912 lines)

Reimplements neighbor-joining, UPGMA, Newick/Nexus/PhyloXML/NeXML parsing and writing, Robinson-Foulds distance, bootstrap consensus trees, parsimony scoring, and substitution models (JC69, K80, HKY85) with Felsenstein pruning. Phylo.jl handles trees, MolecularEvolution.jl handles substitution models and likelihood. The custom PhyloTree struct is a mutable linked tree with a Dict{String,String} metadata field — significantly less capable than established packages.

hmm.jl (~210 lines)

Reimplements Viterbi, forward, and backward algorithms. HMMBase.jl and HiddenMarkovModels.jl exist and are well-tested.

differentialexpression.jl (~769 lines)

Reimplements DESeq2-style analysis: negative binomial GLM fitting, dispersion estimation, LFC shrinkage, VST, ComBat batch correction, surrogate variable estimation. DESeq2.jl wraps the canonical R implementation. The GLM fitting here rolls its own IRLS rather than using GLM.jl.

genomicranges.jl (~796 lines)

Reimplements genomic interval operations: find_overlaps, coverage, flank, promoters, nearest, etc. GenomicFeatures.jl and IntervalTrees.jl do this with proper interval tree indexing rather than the sorted-array approach here.

bam.jl (~1080 lines)

Reimplements BAM reading. XAM.jl is the canonical Julia BAM/SAM library. This is one of the higher-risk reimplementations — BAM parsing is tricky to get right.

popgen.jl (~2026 lines)

Reimplements population genetics: HWE tests, FST/Fst/GST, AMOVA, LD mapping, PCoA, coalescent simulations, etc. PopGen.jl is a mature BioJulia package for exactly this.

structure.jl (~3035 lines, the largest file)

Reimplements PDB/mmCIF parsing, coordinate manipulation, RMSD, structure superposition, secondary structure assignment, and more. BioStructures.jl is a well-maintained Julia package for exactly this, including Ramachandran analysis and PDB downloading.

gwas.jl (~647 lines)

Reimplements PLINK file reading, linear/mixed model GWAS scans, PRS calculation, and LD clumping. JWAS.jl and SnpArrays.jl cover this properly.

microbiome.jl, proteomics.jl, clinical.jl

Implement Bray-Curtis, UniFrac, PCoA, Cox PH regression, Kaplan-Meier, mass spec peak detection. All have established Julia packages: MicrobiomeAnalysis.jl, Survival.jl, ROC.jl.

Summary

The entire package is essentially a monolithic reinvention of BioJulia and related packages, wrapped in a single namespace. Almost none of it uses BioSequences, BioAlignments, BioStructures, GenomicFeatures, XAM, Phylo.jl, PopGen.jl, or any of the other established Julia bioinformatics libraries it overlaps with.

The reimplementations consistently lack the correctness properties of the originals: no parametric type safety on biological alphabets, no Automa-based parsers, no interval trees for genomic overlap queries, no proper affine gap handling, plain Dict and String representations where typed sequence types would be appropriate. Several files contain explicit # Bug fix: comments, which suggests the code was generated and then partially debugged rather than written by someone with domain expertise.

The comparison files in Examples/Comparison/ are revealing — they suggest the package was designed to demonstrate feature parity with Biopython rather than integrate with the Julia ecosystem.

8 Likes

Oh, I also noticed that you have a custom license that’s mostly MIT, with the added

  1. The above copyright notice and this permission notice shall be included in all
    copies or substantial portions of the Software.

  2. Publication Restriction: You may NOT publish scientific papers, articles,
    or technical reports that primarily describe the Software itself (e.g.,
    “meet the new BioToolkit.jl” or performance comparisons or similar “software notes”) without prior
    written permission from the copyright holder.

    You ARE permitted to publish scientific papers that use the Software to
    generate results or as a tool within a broader study, provided the Software
    is cited appropriately.

Not sure why, but this rubs me the wrong way :person_shrugging:

4 Likes

Does that mean lmiq’s SASA comparison unknowingly violated copyright?

Thank you everyone for your feedbacks.

Thank you Leandro for pointing out that BioToolkit.structire_sasa is slower than PDBTools’ implementation, I have now fixed it, this happened due to unnecessary read and writes. Now both the implementations are equally fast.

For the choice of monolithic design I think I need to give a bit of a deep dive. So, I am not a bio major I am a physics major and in physics when we need a HPC capable software we usually use a monolithic design as although it is hard to maintain it ultimately means that the user has to write less boilerplate code. For example, If someone wants to make a RNA-Seq pipeline in BioJulia they can use XAM.jl and GenomicFeatures.jl but they lack statistical methods to perform the analysis and people will have to go back to biopython or R or use GLM.jl to make these models but they still will lack domain-specific normalizations (like TMM) and shrinkage estimators that BioToolkit.jl bundles. This is a lot of boilerplate code. For the next example, BioJulia’s GeneticVariation.jl handles VCF/BCF parsing and basic statistics, but it has no association testing engines. It cannot natively run a GWAS or calculate PRS. Users typically have to call out to PLINK or use R/Python packages. BioToolkit.jl bridges this gap entirely within Julia. Kevin mentioned packages like SingleCellProjections.jl, these packages exist but are not as mature and integrated like Seurat, so using it will be similar to writing an own implementation. The same goes for some Epigenomics and PopGen wrokflows as well, if user wants to use BioJulia they have to write a lot of boilerplate code again and again, this may mean that they don’t even migrate at all. Monolithic package although is hard to maintain and work with but it means that the maintainer is the one doing the boilerplate part. I know that the implementations in BioToolkit.jl are not perfect that is why I have made the license custom because if someone wants to publish a proper scientific research in computation it must be reproducible and I cannot grantee that with BioToolkit.jl as of now. Once I feel the package is good enough, I will change the license to GLP or MIT. For example, the BED implementation used in this package is faster, but it also is a lot simpler and so is not a one-to-one alternate/replacement, and so these benchmarks can be misleading if not done and reported correctly. I hope I was able to explain my choices properly.

Also, as Kevin’s Claude analysis says there is an implementation of DESeq2.jl but this is just a wrapper, ultimately if we want a HPC code someone will have to rewrite that code in Julia and it is because of things like this that I haven’t used a lot of pre-existing libraries. The BED.jl is in dependency because I used it for testing and is not used in the toolkit itself, thanks for pointing it out, I will remove it.

For rest of the queries, I will look into them and try to fix them. And the reason why this repo was made 2 days ago is because I was testing this code in a different directory and made the repo after the package was done. Yes, quite a lot if testing and refactoring was done using AI, and this is why there are comments like #BugFixes and I left them because I wanted to keep a track of all the changes it makes. The analysis Kevin shared using Claude was very helpful as well because I don’t have Claude’s subscription and Claude does a good job at coding so thank you Kevin for sharing this report, I will try to make the algorithm more robust and try to fix problems you listed. I hope you all keep testing this because as I said earlier, I am a Physics major and so i can only do synthetic testing and some light testing only.

Cheers,
Aditya

1 Like

I do hope the critiques aren’t too discouraging - it’s great to get involved in open source, and hopping on a forum like this can be daunting, especially when you’re an undergrad and a lot of folks on here have been doing this stuff professionally for decades.

This community as a whole is very supportive, and if you’re interested in learning and can handle some push back, you will find a wealth of opportunities here to grow as a developer.

This is an important use case - I do a lot of HPC work as well, though I’m not sure a monorepo as a julia package is the right level to solve it. You can have a julia project (eg Project.toml and Manifest.toml) without trying to make a package. And you get a lot of features that all of the various python packaging systems have tried to bolt on over decades.

But that’s really up to you, as I said. There are some that would argue for monolith. At the same time, I think re-implementing a lot of the algortihms is the wrong approach. If you’re going to incur the loading time @Benny mentioned anyway, you might as well get all of the performance benefits that folks have spent lots of time implementing. Eg

is not really true. If you used SCP in your package for all the transformation and normalization steps, but just wrote your interface wrapper around them, you could get the benefits of the package without losing the integration you’re looking for. SCP is not a seurat clone. Since you found it helpful before, here’s Claude’s verdict:

SingleCellProjections.jl (BioJulia) is a sophisticated, production-grade package. Its distinguishing design is that it never materializes large dense matrices — instead, it stores data as lazy MatrixExpressions (e.g., A + B₁B₂B₃ for SCTransform output). It has a proper ProjectionModel abstraction so every transformation step is replayable and composable, enabling first-class projection of new data onto existing fits. The code is carefully threading-aware, handles sparse arithmetic natively in the SCTransform and log-normalize steps (operating on nzrange/nonzeros directly), and wraps everything in a DataMatrix struct with typed variable/observation annotation tables.

BioToolkit.jl singlecell.jl is a much simpler, more monolithic implementation. It stores a SingleCellExperiment with a plain SparseMatrixCSC, Vector{String} gene/cell IDs, and Dicts for reductions and clusters. All pipelines eagerly materialize dense Matrix{Float64}. There are no projection models, no reusable fit objects, and no lazy matrix expressions.

Like I said, if you’re open to other approaches, asking here on this forum is a great way to learn. If you just want to do your own thing, that’s fine too, but I would encourage you to look at what other folks are doing and try to learn from them. In many cases, there’s a good reason for the choices that were made.

6 Likes