Optimizing performance with FASTX I/O stream and Codon Counting

Hello,

I am working to write code for codon usage bias, and my slowest step by far is calculating counts of individual codons for each gene in a fasta sequence of a genome.

The general idea is that I’m using FASTX to stream along each sequences in the fasta file, and then two for loops to count the occurence of each 3-letter codon. The output is a 64 x n matrix, where n is the number of sequences in the fasta file.

My function takes three inputs - the filepath to the fasta file (fna in this case, since it’s open reading frames), a vector of codons, and the number of sequences in the file. The codons and n look goofy here, but they’re generated from previous code that isn’t performance intensive, so I’ve just included them for the simplest example.

Operating on a six-sequence fasta takes ~500us, while whole genomes tend to take ~90ms - I feel like this could be a lot faster?

I’ve tried to improve performance with @inbounds, @views, and pre-allocating my matrix, and have only managed to shave ~4 seconds off. I would appreciate any other advice!

Here is my code, and I’ve included the fasta file as plain text below. Thanks!


# Dependencies
using BioSequences
using FASTX
using BenchmarkTools

# Setting up arguments
# The codons vector looks goofy I know - this is just for the example
fasta_seq = pwd()*"path/to/fasta.fasta" # I shared this file
codons = LongSequence{DNAAlphabet{2}}.(["AAA", "AAC", "AAG", "AAT", "ACA", "ACC", "ACG", "ACT", "AGA", "AGC", "AGG", "AGT", "ATA", "ATC", "ATG", "ATT", "CAA", "CAC", "CAG", "CAT", "CCA", "CCC", "CCG", "CCT", "CGA", "CGC", "CGG", "CGT", "CTA", "CTC", "CTG", "CTT", "GAA", "GAC", "GAG", "GAT", "GCA", "GCC", "GCG", "GCT", "GGA", "GGC", "GGG", "GGT", "GTA", "GTC", "GTG", "GTT", "TAA", "TAC", "TAG", "TAT", "TCA", "TCC", "TCG", "TCT", "TGA", "TGC", "TGG", "TGT", "TTA", "TTC", "TTG", "TTT"])


function minimal_count_codon_matrix(filepath, codons, n)
    seq_reader = FASTA.Reader(open(filepath, "r"),copy = false) # Open the io
    matrix = zeros(Int64, 64, n) # Initialize a matrix of zeros
    @inbounds for (j,sequence) in zip(1:n,seq_reader)
        dna = LongSequence{DNAAlphabet{2}}(FASTX.FASTA.sequence(sequence)) # Read actual DNA sequence
        len = length(dna)
        rem = len % 3 # Codons come in threes - so we'll use this to trim off excess nucleotides
        num_aa = Int64((len - rem)//3) # Find number of codons (i.e. amino acids)
            for i in 1:num_aa
                cod_i = (3i-2):3i # Define range of sequence for 1 codon
                codon = @views dna[cod_i] # Extract the DNA sequence of that codon
                pos = findfirst(x->x==codon, codons) # Find where in the codons vector the codon falls
                matrix[pos, j] += 1 # Add a "count" for that codon, for this sequence
            end
        end
        return matrix
end

minimal_count_codon_matrix(fasta_seq, codons, 6)
@benchmark minimal_count_codon_matrix(fasta_seq, codons, 6)
>a
TTCGGCCATTGGGAGGCCGACAGCGTCATCGGAGTGGGATGCAACCTGCACACCGAGGTGGAGAGGAGGACCAGGTTCCT
CATGGCCCGCATCGTTCCGGACAAGACCGCCGGCGAGAGCGTCGGGGCGCAGCTGGACATGTTCTCGCCGCTGCCGGCCG
GCGCGCGCGTCAGCGTCACGCACGACAACGGCACCGAGTTCGCCCACCACGAAAGGCTGCGCGACGGGCTGGGCATGGCC
ACGTACTTCGCCGACCCGTACTCCAGCTGGCAGCGGGGAAGCAACGAGAACCGCAACGGCATGATCCGCCGGTATCTGCC
CAAACGCTGCGAGATCCGGATGGACATGGCAAAGGAGGTCAGGGAGATCGTCGACGAGATCAACAACCGCCCCATGCGCG
TGCTCGGCTACCGCACGCCCGCCGAGGCGTTCGCCGACGAACTGTTAGAATTACAGGACCAACAAGGGTGTTGCACTTCT
AAATAG
>b
TTGCCTACTATTGAACAGCTCGTCCGTAAGGGACGTCAGGCAAAGCCGAAGAAGTCCAAGACTTTGGCCCTGAAGGGCAG
CCCGCTGCGTCGCGGCGTGTGCACCCGTGTTTACACCACCACCCCGAAGAAGCCGAACTCGGCTCTGCGTAAGGTTGCTC
GTGTGCGCCTGTCCTCGGGCATCGAAGTCACCGCCTACATTCCGGGCGAAGGCCACAATCTGCAGGAGCACTCCATCGTG
CTCGTGCGCGGCGGCCGTGTGAAGGATCTCCCGGGTGTGCGTTACCACATCGTGCGTGGCGCGCTCGATACCCAGGGTGT
CAAGGACCGTAAGCAGGGTCGTTCCCTGTATGGAGCAAAGAAGGCGAAGTAA
>c
ATGTCACGTAAGGGACCTTCCAAGAAGCACGTCGTGCTGCCCGATCCGATCTACGGTTCCACCGTGGTGGCGCAGCTCAT
CAACAAGATTCTGCTTGACGGCAAGAAGTCGATCGCCGAAGACATCGTCTACTCCGCGCTCGATATGGTCAAGGAGAAGT
CCGACCAGGAGCCGGTGGCTGTGCTCAAGCGCGCCCTGGACAACATCCGTCCGTCCCTCGAGGTTCGCTCCCGCCGTGTC
GGTGGCGCCACCTACCAGGTGCCGGTCGAGGTCAAGCCGAACCGCGCCAACACCCTGTCGCTGCGCTGGCTGACCGATTT
CTCCCGCGCCCGTCGTGAGAAGACCATGGCTGAGCGTCTCGCCAACGAGATCCTGGATGCCTCCAACGGCCTCGGTGCTT
CTGTCAAGCGCCGCGAGGATACTCATAAGATGGCAGAGGCTAACAAGGCCTTCGCTCATTACCGCTGGTAA
>d
ATGGCTGAAGAGATTTCGGACCTCCACGACGTCCGCAATATCGGCATCATGGCCCACATTGATGCCGGCAAGACCACCAC
TACCGAGCGTATTCTGTTCTACACCGGTAAGAACTACAAGATCGGCGAGACCCACGACGGCGCCTCCACCATGGACTTCA
TGGCGCAGGAGCAGGAACGTGGTATCACCATCCAGTCCGCTGCTACCACCTGCTTCTGGAGCCGTCAGTCCCACGACACC
AAGGACAAGTTCCAGATCAACATCATCGATACCCCTGGCCACGTGGACTTCACGGCCGAGGTGGAGCGCTCCCTGCGTGT
GCTCGATGGTGCCGTTGCCGTGTTCGACGGCAAGGAAGGCGTGGAGCCGCAGTCCGAAACCGTGTGGCGTCAGGCTGACA
AGTACGGCGTTCCGCGTATCTGCTTCATCAACAAGATGGACAAGCTGGGCGCGAACTTCTACTACTCCGTCGACACCATC
AAGGAGAAGCTGGGTGCTACTCCGATCGTGATGCAGCTGCCGATTGGTTCCGAGAACGACTTCACTGGCGTTGTCGACCT
GGTTGAGATGCAGGCTTACGTGTGGAACGGCACCGAGGAGCTCGGCGCCAAGTACGACACCACTGAAATCCCGGACGACC
TCAAGGACAAGGCTCAGGAGTACCACGAGAAGCTCGTGGAAGCCGCTGCCGAAGCCGACGACGACCTCATGAACAAGTTC
TTCGAGGACGGCGACCTGTCCAAGGAAGACATCCGCGCTGGCGTGCGTAAGCTCACCATCGCCAAGGAAGCCTTCCCGAT
CTTCTGCGGTTCCGCCTTCAAGGACAAGGGCGTTCAGCCGATGCTGGACGGCGTTGTCGATTACCTGCCGTCTCCTGAAG
ACGTGCCGGCCATCAAGGGCTACAAGCCTGGCGATGAGTCCGTCGAGATCGACCGTCACCCGGTCAAGTCCGATCCGTTC
GCGGCTCTGGTCTTCAAGATCTCCACCCACCCGTTCTACGGCAAGCTCGTGTTCGTGCGCGTCTACTCCGGCTCCGTTGT
GCCGGGCGACTCCGTGCTCGACTCCACCCGCGAGAAGAAGGAACGTATCGGCAAGATCTTCCAGATGCACGCCGATAAGG
AAAACCCGATGGACCGCGCTGACGCCGGTAACATCTACACCTTCGTGGGCCTGAAGAACGTGACCACCGGTGACACCCTG
TGCGCCATCGACGATCCGATCACCCTCGACTCCATGACCTTCCCGGATCCTGTGATCCAGGTGGCCGTCGAACCGAAGAC
CAAGGCCGACCAGGAGAAGATGGGCATCGCCCTGTCCAAGCTGGCCGAAGAGGATCCGACCTTCCAGGTCACGACCGACG
AAGAGTCCGGCCAGACCCTGATCGCCGGCATGGGCGAGCTCCAGCTCGACATCATCGTCGACCGTATGCGCCGCGAGTTC
AAGGTTGAGTGCAACCAGGGCAAGCCGCAGGTCGCCTACCGCGAGACCATCCGCAAGGCCGTCATGGACCAGGGCTACAC
CCACAAGAAGCAGACCGGTGGTTCCGGCCAGTTCGCAAAGGTCCTGATGAACTTCGAGCCGCTCGACACCACCGAGGGCA
AGACCTTCGAGTTCGAGAATAAGGTCACCGGCGGCCACATCTCCGCCGAATTCATCGGCCCCATCGAGGCCGGTGTCAAG
GAGGCCATGGAGTCCGGTGTTCTCGCCGGCTTCCCGGTCGTGGGCGTCAAGGCCACCGTCACTGACGGCCAGATGCACCC
GGTCGATTCCTCCGAAATGGCCTTCAAGCTCGCAGGTTCCATGTGCTTCAAGGAAGCTGCCCCGAAGGCCAAGCCGGTCA
TTCTCGAACCGATCATGAAGGTCGAAGTCCGTACGCCGGAAGAGTACATGGGCGAAGTCATCGGCGATCTGAACCAGCGC
CGTGGCAACATCCAGTCCATGACCGACGGTGTCGGCGTCAAGGTCATCGACGCCAAGGTCCCGCTGTCCGAGATGTTCGG
CTACATCGGCGACCTGCGTTCCAAGACCCAGGGTCGTGCCATGTTCACCATGGAGATGGACTCCTACGACGAGGTGCCGA
AGAGCGTCTCCGAGGAGATCATCAAGGCCCAGCGCGGCGAGTGA
>e
ATGGCAAAGGAAAAGTACGAGCGTACTAAGCCGCACGTTAACATCGGTACCATCGGCCACGTCGACCACGGTAAGACTAC
CCTGACTGCCGCCATCTCCAAGGTGCTGCACGAGGAGTTCCCGGATGTCAACCCGGAGTACGACTTCAACCAGATCGATT
CCGCTCCGGAAGAGGCCGCCCGCGGTATCACCATCAACATCGCCCACATCGAGTACCAGACCGAGAAGCGTCACTACGCT
CACGTCGACTGCCCGGGCCACGCCGACTTCGTGAAGAACATGATTACCGGTGCTGCCCAGATGGATGGCGCTATCCTCGT
TGTGGCCGCCACCGACGGCCCGATGGCCCAGACTCGCGAGCACGTGCTGCTCGCCCGTCAGGTTGGCGTTCCGAAGATCC
TCGTCGCCCTGAACAAGTGCGACATGGTCGACGATGAAGAGCTCATCGAGCTCGTCGAAGAAGAGGTCCGCGACCTCCTC
GACGAGAACGGCTTCGACCGTGACTGCCCGGTCATCCACACCTCCGCTTACGGTGCTCTGCACGACGACGCTCCGGACCA
CGAGAAGTGGGTCCAGTCCGTTAAGGACCTCATGGACGCTGTCGACGACTACATCCCGACCCCGGTTCACGACTTGGACA
AGCCGTTCCTGATGCCGATCGAGGACGTCTTCACCATCTCCGGCCGTGGTACCGTTGTCACCGGTCGTGTCGAGCGTGGC
CAGCTGGCCGTCAACACCCCGGTCGAGATCGTTGGTATCCGTCCGACCCAGCAGACCACCGTCACCTCCATCGAGACCTT
CCACAAGACCATGGACGCCTGCGAGGCTGGCGACAACACCGGTCTGCTTCTGCGTGGTCTCGGCCGTGACGATGTCGAGC
GTGGCCAGGTTGTGGCCAAGCCGGGCTCCGTCACCCCGCACACCAAGTTCGAGGGCGAAGTCTACGTGCTGACCAAGGAC
GAAGGCGGCCGTCACTCGCCGTTCTTCTCCAACTACCGTCCGCAGTTCTACTTCCGCACCACCGACGTCACCGGCGTCAT
CGAGCTGCCGGAAGGCGTCGAGATGGTTCAGCCGGGCGACCACGCTACCTTCACCGTTGAGCTGATTCAGCCCATCGCTA
TGGAGGAAGGCCTGACCTTCGCTGTGCGTGAAGGTGGCCACACCGTCGGCTCCGGTCGTGTGACCAAGATCCTCGCCTGA
>f
ATGATGACTGGTGCACAGGCCGCTCACTGTGGTTCTGTATCCGCAATTTCGCTGGGACTGCCCGTTTCCACGGCGATTCC
CGAAGCCAAAGGCACACTGCCGAAAGCCTTGTTCGTAGGCAAAGCGCCGATTTCCGGCAAACTCAAGCAGCGATTGGTGA
ACGAGATCGAATCCATCACCATGCTGGCGTTGGCCCGGTCGGCCAACACGGGGCTTGCGGACGGCAAACTGATCCCGGAA
GTGCTGGTCATCGGACTGAAGCTTACGGCCAAAGCCACCGGCATACCGAACGAAGTCATCGACCTGATCGCCGGACAGCG
CAAATCGGGCATCGTGTTCGTGTGCGTACGCGACGTGCCGTTCGAGGGCAGCACGCGTGAGGAATGCGCGTTCGCCGTGC
GCCGCGCCATACCGGGGCGTGCCGGGCACACGCCGATCTATCAGGTATATGCCGGCGACTGGCAGCCGGCGGGGGAGGCC
CAGCTGGAGCTGGCCGGCTCTACTATTGATGAGCTGTGGGCATCGCTGTGCTCACAGACCATTTTGGGTACGCCGGAGGT
CGAGAATCTGGATGCGCGCATCATTCGCCATACTGAAATTGCTCGGCTCGAATCCGAAGTGGACAAGCTCACTCGTGACC
ATCAGCGCGTCAAGAACCCGGCCCAGCGCAACGAGATCTATGCCAAACTCCACAAGGCCAAGGCGCAGCTCGCAAAACTG
CGCGAGGCATAG
1 Like

Hey @gus-pendleton

Exciting to see another bioinformatician interested in kmers!
Indeed, your code can be made much much faster. We ought to have a convenient kmer library in BioJulia, but unfortunately we don’t at the moment (see the my other post).
The three major inefficiencies I can see in your code is:

  • You convert the FASTA records to LongDNA{2} whereas the kmers could be computed from the raw FASTA sequence
  • You use findfirst to find the correct codon.

A better approach would be

  • Use codeunits(sequence(record)) to get a byte vector of each FASTA record
  • Represent your kmer as an UInt or similar
  • Iterate over the bytes. For every byte, incrementally construct the next kmer.
  • Use the kmer directly as an index in the matrix

Do something like this (you’d need to define TABLE as an NTuple{256, UInt8} lookup table with the bitwise encoding of each nucleotide and 0xff for ambiguous/invalid nucs.

open(FASTA.Reader, "aln.fna"; copy=false) do reader
           mask = UInt((1 << 12) - 1)
           for (seqno, record) in enumerate(reader)
               bytes = codeunits(sequence(record))
               kmer = UInt(0)
               remaining = 3
               for byte in bytes
                   value = TABLE[byte + 0x01]
                   remaining = ifelse(value == 0xff, 3, remaining-1)
                   remaining ≥ 0 && continue
                   kmer = ((kmer << 2) | value) & mask
                   matrix[kmer + 0x01, seqno] += 1
               end
           end
           matrix
       end

Again, this should just be in Kmers.jl.

For performance, I implemented something similar to this in an upcoming package of mine, and it processes a 400 MB file in 1.4 seconds.

2 Likes

Couldn’t help myself: The code below is an attempt of doing a similar thing for arbitrary K.
It can still be made faster:

  • The if/else spam in the hot loop could be micro-optimised to reduce branch mispredictions
  • Make use of @inbounds
  • Preallocate larger parts of the buffer in chunks to prevent more reallocations of the vector during append!
  • Count with Int32 instead of Int to reduce memory footprint

Nonetheless, this counts a 32 MB in about 200 ms on my computer.

using FASTX: FASTAReader, sequence

const TABLE = let
    table = fill(0xff, 2^8)
    for (n, v) in [('A', 0), ('C', 1), ('G', 2), ('T', 3), ('U', 3)]
        table[UInt8(n) + 1] = v
        table[UInt8(lowercase(n)) + 1] = v
    end
    for n in "SWKYMRBDHVN"
        table[UInt8(n) + 1] = 0xf0
        table[UInt8(lowercase(n)) + 1] = 0xf0
    end
    Tuple(table)
end

function count_kmers!(
    vector::AbstractVector{<:Integer},
    seq::AbstractString,
    ::Val{K}
) where K
    fill!(vector, 0)
    mask = UInt(1 << 2K - 1)
    kmer = UInt(0)
    remaining = K
    for codeunit in codeunits(seq)
        value = TABLE[codeunit + 0x01]
        if value == 0xff
            throw(DomainError(codeunit, "Cannot interpret as nucleotide"))
        elseif value == 0xf0
            remaining = 4
        else
            remaining -= 1
        end
        if remaining < 1
            kmer = (kmer << 2 | value) & mask
            vector[kmer + 1] += 1
        end
    end
end

function count_kmers(reader::FASTAReader, ::Val{K}) where K
    buffer = zeros(Int, 4^K)
    result = Int[]
    for record in reader
        count_kmers!(buffer, sequence(record), Val{K}())
        append!(result, buffer)
    end
    reshape(result, 4^K, :)
end
            
function count_kmers(path::AbstractString, ::Val{K}) where K
    open(FASTAReader, path; copy=false) do reader
        count_kmers(reader, Val{K}())
    end
end
4 Likes

WOW - this is amazing! I’m going to chew my way through this code and I’ll follow-up with those other recommended optimizations :slight_smile:

I will definitely defer to Jakob on the optimizations here, and ditto the excitement about your interest in doing sequence analysis in julia! I just wanted to mention two trivial style optimizations.

  1. When dealing with paths, I highly recommend using joinpath() rather than string concatenation. In other words, I’d re-write your file assignment as fastq_seq = joinpath(pwd(), "path", "to", "fasta.fasta"). The primary benefits are that joinpath() handles any os-specific weirdness (in case you ever care to make your code work on Windows for example), and it’s easier to extend.
  2. For writing your codons, I know you said you copied it from elsewhere, and also Jakob is telling you to operate on bytes instead of Strings (which is good advice), but it’s an opportunity to tell you about Iterators (which has a lot of nice stuff, and is loaded by default). In particular:
julia> codons = prod.(Iterators.product("ACGT", "ACGT", "ACGT"))
4×4×4 Array{String, 3}:
[:, :, 1] =
 "AAA"  "ACA"  "AGA"  "ATA"
 "CAA"  "CCA"  "CGA"  "CTA"
 "GAA"  "GCA"  "GGA"  "GTA"
 "TAA"  "TCA"  "TGA"  "TTA"

[:, :, 2] =
 "AAC"  "ACC"  "AGC"  "ATC"
 "CAC"  "CCC"  "CGC"  "CTC"
 "GAC"  "GCC"  "GGC"  "GTC"
 "TAC"  "TCC"  "TGC"  "TTC"

[:, :, 3] =
 "AAG"  "ACG"  "AGG"  "ATG"
 "CAG"  "CCG"  "CGG"  "CTG"
 "GAG"  "GCG"  "GGG"  "GTG"
 "TAG"  "TCG"  "TGG"  "TTG"

[:, :, 4] =
 "AAT"  "ACT"  "AGT"  "ATT"
 "CAT"  "CCT"  "CGT"  "CTT"
 "GAT"  "GCT"  "GGT"  "GTT"
 "TAT"  "TCT"  "TGT"  "TTT"

(and you could always wrap that in vec() if you really need a Vector rather than 3D array). Using the stuff in Iterators can be really great for simple performance boosters, because you don’t need to allocate any arrays. Eg, the above could be written as

julia> itr = Iterators.map(prod, Iterators.product("ACGT", "ACGT", "ACGT"))
Base.Generator{Base.Iterators.ProductIterator{Tuple{String, String, String}}, typeof(prod)}(prod, Base.Iterators.ProductIterator{Tuple{String, String, String}}(("ACGT", "ACGT", "ACGT")))
1 Like

Very helpful!

Yes I’m switching to bytes now, much faster! The joinpath() is also very useful, I am trying to decide whether the final function will expect to user to provide full filepaths, or operate within the working directory.

Jakob, your code worked lovely, with a few tweaks in the count_kmers!() function. I reset remaining to 3 after every codon cycle, and moved the kmer re-definition into the first if block. I also sprinkled in some @inbounds, which shave a few ms off. I’ll keep working on the if-else and the preallocation!

using FASTX: FASTAReader, sequence
const TABLE = let
    table = fill(0xff, 2^8)
    for (n, v) in [('A', 0), ('C', 1), ('G', 2), ('T', 3), ('U', 3)] 
        table[UInt8(n) + 1] = v 
        table[UInt8(lowercase(n)) + 1] = v 
    end
    for n in "SWKYMRBDHVN"
        table[UInt8(n) + 1] = 0xf0
        table[UInt8(lowercase(n)) + 1] = 0xf0
    end
    Tuple(table)
end

function count_kmers!(
    vector::AbstractVector{<:Integer},
    seq::AbstractString, 
    ::Val{K} 
) where K 
    fill!(vector, 0) 
    mask = UInt(1 << 2K - 1) 
    remaining = K
    kmer = UInt8(0)
    for codeunit in codeunits(seq)
        value = TABLE[codeunit + 0x01] 
        if value == 0xff
            throw(DomainError(codeunit, "Cannot interpret as nucleotide"))
        elseif value == 0xf0
            remaining = K # This was K + 1, but since the kmer definition is now in the final else, can just be K
        else
            remaining -= 1
            kmer = (kmer << 2 | value) & mask # This needed to be moved outside of the second if block
        end
        if remaining < 1
            @inbounds vector[kmer + 1] += 1
            remaining = K # This needed to be reset per codon "cycle"
        end
    end
end

function count_kmers(reader::FASTAReader, ::Val{K}) where K
    buffer = zeros(Int, 4^K) 
    result = Int32[]
    for record in reader
        count_kmers!(buffer, sequence(record), Val{K}())
        @inbounds append!(result, buffer)
    end
    @inbounds reshape(result, 4^K, :)
end

function count_kmers(path::AbstractString, ::Val{K}) where K
    open(FASTAReader, path; copy=false) do reader
        count_kmers(reader, Val{K}())
    end
end