Optimizing performance with FASTX I/O stream and Codon Counting

jakobnissen · April 26, 2023, 4:57am

Exciting to see another bioinformatician interested in kmers!
Indeed, your code can be made much much faster. We ought to have a convenient kmer library in BioJulia, but unfortunately we don’t at the moment (see the my other post).
The three major inefficiencies I can see in your code is:

You convert the FASTA records to LongDNA{2} whereas the kmers could be computed from the raw FASTA sequence
You use findfirst to find the correct codon.

A better approach would be

Use codeunits(sequence(record)) to get a byte vector of each FASTA record
Represent your kmer as an UInt or similar
Iterate over the bytes. For every byte, incrementally construct the next kmer.
Use the kmer directly as an index in the matrix

Do something like this (you’d need to define TABLE as an NTuple{256, UInt8} lookup table with the bitwise encoding of each nucleotide and 0xff for ambiguous/invalid nucs.

open(FASTA.Reader, "aln.fna"; copy=false) do reader
           mask = UInt((1 << 12) - 1)
           for (seqno, record) in enumerate(reader)
               bytes = codeunits(sequence(record))
               kmer = UInt(0)
               remaining = 3
               for byte in bytes
                   value = TABLE[byte + 0x01]
                   remaining = ifelse(value == 0xff, 3, remaining-1)
                   remaining ≥ 0 && continue
                   kmer = ((kmer << 2) | value) & mask
                   matrix[kmer + 0x01, seqno] += 1
               end
           end
           matrix
       end

Again, this should just be in Kmers.jl.

For performance, I implemented something similar to this in an upcoming package of mine, and it processes a 400 MB file in 1.4 seconds.

Topic		Replies	Views
Developing a kmer counter in Julia Biology, Health, and Medicine question	7	2410	September 19, 2019
Optimizing dinucleotides count in a DNA sequence type `LongDNA` Performance question , performance , strings , biology	21	842	July 17, 2023
How to improve a Generator to be more memory efficient when it is collected? Performance question , regex , optimization , iterators	21	980	January 23, 2023
For loop performance vs "functional" performance Performance advent-of-code	17	519	January 4, 2025
Julia programs now shown on benchmarks game website Community announcement	144	13745	December 3, 2019

Optimizing performance with FASTX I/O stream and Codon Counting

Related topics