Hey @gus-pendleton
Exciting to see another bioinformatician interested in kmers!
Indeed, your code can be made much much faster. We ought to have a convenient kmer library in BioJulia, but unfortunately we don’t at the moment (see the my other post).
The three major inefficiencies I can see in your code is:
- You convert the FASTA records to
LongDNA{2}
whereas the kmers could be computed from the raw FASTA sequence - You use
findfirst
to find the correct codon.
A better approach would be
- Use
codeunits(sequence(record))
to get a byte vector of each FASTA record - Represent your kmer as an
UInt
or similar - Iterate over the bytes. For every byte, incrementally construct the next kmer.
- Use the kmer directly as an index in the matrix
Do something like this (you’d need to define TABLE as an NTuple{256, UInt8}
lookup table with the bitwise encoding of each nucleotide and 0xff for ambiguous/invalid nucs.
open(FASTA.Reader, "aln.fna"; copy=false) do reader
mask = UInt((1 << 12) - 1)
for (seqno, record) in enumerate(reader)
bytes = codeunits(sequence(record))
kmer = UInt(0)
remaining = 3
for byte in bytes
value = TABLE[byte + 0x01]
remaining = ifelse(value == 0xff, 3, remaining-1)
remaining ≥ 0 && continue
kmer = ((kmer << 2) | value) & mask
matrix[kmer + 0x01, seqno] += 1
end
end
matrix
end
Again, this should just be in Kmers.jl.
For performance, I implemented something similar to this in an upcoming package of mine, and it processes a 400 MB file in 1.4 seconds.