Optimizing performance with FASTX I/O stream and Codon Counting

Hey @gus-pendleton

Exciting to see another bioinformatician interested in kmers!
Indeed, your code can be made much much faster. We ought to have a convenient kmer library in BioJulia, but unfortunately we don’t at the moment (see the my other post).
The three major inefficiencies I can see in your code is:

  • You convert the FASTA records to LongDNA{2} whereas the kmers could be computed from the raw FASTA sequence
  • You use findfirst to find the correct codon.

A better approach would be

  • Use codeunits(sequence(record)) to get a byte vector of each FASTA record
  • Represent your kmer as an UInt or similar
  • Iterate over the bytes. For every byte, incrementally construct the next kmer.
  • Use the kmer directly as an index in the matrix

Do something like this (you’d need to define TABLE as an NTuple{256, UInt8} lookup table with the bitwise encoding of each nucleotide and 0xff for ambiguous/invalid nucs.

open(FASTA.Reader, "aln.fna"; copy=false) do reader
           mask = UInt((1 << 12) - 1)
           for (seqno, record) in enumerate(reader)
               bytes = codeunits(sequence(record))
               kmer = UInt(0)
               remaining = 3
               for byte in bytes
                   value = TABLE[byte + 0x01]
                   remaining = ifelse(value == 0xff, 3, remaining-1)
                   remaining ≥ 0 && continue
                   kmer = ((kmer << 2) | value) & mask
                   matrix[kmer + 0x01, seqno] += 1
               end
           end
           matrix
       end

Again, this should just be in Kmers.jl.

For performance, I implemented something similar to this in an upcoming package of mine, and it processes a 400 MB file in 1.4 seconds.

2 Likes