Hello,
I am working to write code for codon usage bias, and my slowest step by far is calculating counts of individual codons for each gene in a fasta sequence of a genome.
The general idea is that I’m using FASTX to stream along each sequences in the fasta file, and then two for loops to count the occurence of each 3-letter codon. The output is a 64 x n matrix, where n is the number of sequences in the fasta file.
My function takes three inputs - the filepath to the fasta file (fna in this case, since it’s open reading frames), a vector of codons, and the number of sequences in the file. The codons and n look goofy here, but they’re generated from previous code that isn’t performance intensive, so I’ve just included them for the simplest example.
Operating on a six-sequence fasta takes ~500us, while whole genomes tend to take ~90ms - I feel like this could be a lot faster?
I’ve tried to improve performance with @inbounds, @views, and pre-allocating my matrix, and have only managed to shave ~4 seconds off. I would appreciate any other advice!
Here is my code, and I’ve included the fasta file as plain text below. Thanks!
# Dependencies
using BioSequences
using FASTX
using BenchmarkTools
# Setting up arguments
# The codons vector looks goofy I know - this is just for the example
fasta_seq = pwd()*"path/to/fasta.fasta" # I shared this file
codons = LongSequence{DNAAlphabet{2}}.(["AAA", "AAC", "AAG", "AAT", "ACA", "ACC", "ACG", "ACT", "AGA", "AGC", "AGG", "AGT", "ATA", "ATC", "ATG", "ATT", "CAA", "CAC", "CAG", "CAT", "CCA", "CCC", "CCG", "CCT", "CGA", "CGC", "CGG", "CGT", "CTA", "CTC", "CTG", "CTT", "GAA", "GAC", "GAG", "GAT", "GCA", "GCC", "GCG", "GCT", "GGA", "GGC", "GGG", "GGT", "GTA", "GTC", "GTG", "GTT", "TAA", "TAC", "TAG", "TAT", "TCA", "TCC", "TCG", "TCT", "TGA", "TGC", "TGG", "TGT", "TTA", "TTC", "TTG", "TTT"])
function minimal_count_codon_matrix(filepath, codons, n)
seq_reader = FASTA.Reader(open(filepath, "r"),copy = false) # Open the io
matrix = zeros(Int64, 64, n) # Initialize a matrix of zeros
@inbounds for (j,sequence) in zip(1:n,seq_reader)
dna = LongSequence{DNAAlphabet{2}}(FASTX.FASTA.sequence(sequence)) # Read actual DNA sequence
len = length(dna)
rem = len % 3 # Codons come in threes - so we'll use this to trim off excess nucleotides
num_aa = Int64((len - rem)//3) # Find number of codons (i.e. amino acids)
for i in 1:num_aa
cod_i = (3i-2):3i # Define range of sequence for 1 codon
codon = @views dna[cod_i] # Extract the DNA sequence of that codon
pos = findfirst(x->x==codon, codons) # Find where in the codons vector the codon falls
matrix[pos, j] += 1 # Add a "count" for that codon, for this sequence
end
end
return matrix
end
minimal_count_codon_matrix(fasta_seq, codons, 6)
@benchmark minimal_count_codon_matrix(fasta_seq, codons, 6)
>a
TTCGGCCATTGGGAGGCCGACAGCGTCATCGGAGTGGGATGCAACCTGCACACCGAGGTGGAGAGGAGGACCAGGTTCCT
CATGGCCCGCATCGTTCCGGACAAGACCGCCGGCGAGAGCGTCGGGGCGCAGCTGGACATGTTCTCGCCGCTGCCGGCCG
GCGCGCGCGTCAGCGTCACGCACGACAACGGCACCGAGTTCGCCCACCACGAAAGGCTGCGCGACGGGCTGGGCATGGCC
ACGTACTTCGCCGACCCGTACTCCAGCTGGCAGCGGGGAAGCAACGAGAACCGCAACGGCATGATCCGCCGGTATCTGCC
CAAACGCTGCGAGATCCGGATGGACATGGCAAAGGAGGTCAGGGAGATCGTCGACGAGATCAACAACCGCCCCATGCGCG
TGCTCGGCTACCGCACGCCCGCCGAGGCGTTCGCCGACGAACTGTTAGAATTACAGGACCAACAAGGGTGTTGCACTTCT
AAATAG
>b
TTGCCTACTATTGAACAGCTCGTCCGTAAGGGACGTCAGGCAAAGCCGAAGAAGTCCAAGACTTTGGCCCTGAAGGGCAG
CCCGCTGCGTCGCGGCGTGTGCACCCGTGTTTACACCACCACCCCGAAGAAGCCGAACTCGGCTCTGCGTAAGGTTGCTC
GTGTGCGCCTGTCCTCGGGCATCGAAGTCACCGCCTACATTCCGGGCGAAGGCCACAATCTGCAGGAGCACTCCATCGTG
CTCGTGCGCGGCGGCCGTGTGAAGGATCTCCCGGGTGTGCGTTACCACATCGTGCGTGGCGCGCTCGATACCCAGGGTGT
CAAGGACCGTAAGCAGGGTCGTTCCCTGTATGGAGCAAAGAAGGCGAAGTAA
>c
ATGTCACGTAAGGGACCTTCCAAGAAGCACGTCGTGCTGCCCGATCCGATCTACGGTTCCACCGTGGTGGCGCAGCTCAT
CAACAAGATTCTGCTTGACGGCAAGAAGTCGATCGCCGAAGACATCGTCTACTCCGCGCTCGATATGGTCAAGGAGAAGT
CCGACCAGGAGCCGGTGGCTGTGCTCAAGCGCGCCCTGGACAACATCCGTCCGTCCCTCGAGGTTCGCTCCCGCCGTGTC
GGTGGCGCCACCTACCAGGTGCCGGTCGAGGTCAAGCCGAACCGCGCCAACACCCTGTCGCTGCGCTGGCTGACCGATTT
CTCCCGCGCCCGTCGTGAGAAGACCATGGCTGAGCGTCTCGCCAACGAGATCCTGGATGCCTCCAACGGCCTCGGTGCTT
CTGTCAAGCGCCGCGAGGATACTCATAAGATGGCAGAGGCTAACAAGGCCTTCGCTCATTACCGCTGGTAA
>d
ATGGCTGAAGAGATTTCGGACCTCCACGACGTCCGCAATATCGGCATCATGGCCCACATTGATGCCGGCAAGACCACCAC
TACCGAGCGTATTCTGTTCTACACCGGTAAGAACTACAAGATCGGCGAGACCCACGACGGCGCCTCCACCATGGACTTCA
TGGCGCAGGAGCAGGAACGTGGTATCACCATCCAGTCCGCTGCTACCACCTGCTTCTGGAGCCGTCAGTCCCACGACACC
AAGGACAAGTTCCAGATCAACATCATCGATACCCCTGGCCACGTGGACTTCACGGCCGAGGTGGAGCGCTCCCTGCGTGT
GCTCGATGGTGCCGTTGCCGTGTTCGACGGCAAGGAAGGCGTGGAGCCGCAGTCCGAAACCGTGTGGCGTCAGGCTGACA
AGTACGGCGTTCCGCGTATCTGCTTCATCAACAAGATGGACAAGCTGGGCGCGAACTTCTACTACTCCGTCGACACCATC
AAGGAGAAGCTGGGTGCTACTCCGATCGTGATGCAGCTGCCGATTGGTTCCGAGAACGACTTCACTGGCGTTGTCGACCT
GGTTGAGATGCAGGCTTACGTGTGGAACGGCACCGAGGAGCTCGGCGCCAAGTACGACACCACTGAAATCCCGGACGACC
TCAAGGACAAGGCTCAGGAGTACCACGAGAAGCTCGTGGAAGCCGCTGCCGAAGCCGACGACGACCTCATGAACAAGTTC
TTCGAGGACGGCGACCTGTCCAAGGAAGACATCCGCGCTGGCGTGCGTAAGCTCACCATCGCCAAGGAAGCCTTCCCGAT
CTTCTGCGGTTCCGCCTTCAAGGACAAGGGCGTTCAGCCGATGCTGGACGGCGTTGTCGATTACCTGCCGTCTCCTGAAG
ACGTGCCGGCCATCAAGGGCTACAAGCCTGGCGATGAGTCCGTCGAGATCGACCGTCACCCGGTCAAGTCCGATCCGTTC
GCGGCTCTGGTCTTCAAGATCTCCACCCACCCGTTCTACGGCAAGCTCGTGTTCGTGCGCGTCTACTCCGGCTCCGTTGT
GCCGGGCGACTCCGTGCTCGACTCCACCCGCGAGAAGAAGGAACGTATCGGCAAGATCTTCCAGATGCACGCCGATAAGG
AAAACCCGATGGACCGCGCTGACGCCGGTAACATCTACACCTTCGTGGGCCTGAAGAACGTGACCACCGGTGACACCCTG
TGCGCCATCGACGATCCGATCACCCTCGACTCCATGACCTTCCCGGATCCTGTGATCCAGGTGGCCGTCGAACCGAAGAC
CAAGGCCGACCAGGAGAAGATGGGCATCGCCCTGTCCAAGCTGGCCGAAGAGGATCCGACCTTCCAGGTCACGACCGACG
AAGAGTCCGGCCAGACCCTGATCGCCGGCATGGGCGAGCTCCAGCTCGACATCATCGTCGACCGTATGCGCCGCGAGTTC
AAGGTTGAGTGCAACCAGGGCAAGCCGCAGGTCGCCTACCGCGAGACCATCCGCAAGGCCGTCATGGACCAGGGCTACAC
CCACAAGAAGCAGACCGGTGGTTCCGGCCAGTTCGCAAAGGTCCTGATGAACTTCGAGCCGCTCGACACCACCGAGGGCA
AGACCTTCGAGTTCGAGAATAAGGTCACCGGCGGCCACATCTCCGCCGAATTCATCGGCCCCATCGAGGCCGGTGTCAAG
GAGGCCATGGAGTCCGGTGTTCTCGCCGGCTTCCCGGTCGTGGGCGTCAAGGCCACCGTCACTGACGGCCAGATGCACCC
GGTCGATTCCTCCGAAATGGCCTTCAAGCTCGCAGGTTCCATGTGCTTCAAGGAAGCTGCCCCGAAGGCCAAGCCGGTCA
TTCTCGAACCGATCATGAAGGTCGAAGTCCGTACGCCGGAAGAGTACATGGGCGAAGTCATCGGCGATCTGAACCAGCGC
CGTGGCAACATCCAGTCCATGACCGACGGTGTCGGCGTCAAGGTCATCGACGCCAAGGTCCCGCTGTCCGAGATGTTCGG
CTACATCGGCGACCTGCGTTCCAAGACCCAGGGTCGTGCCATGTTCACCATGGAGATGGACTCCTACGACGAGGTGCCGA
AGAGCGTCTCCGAGGAGATCATCAAGGCCCAGCGCGGCGAGTGA
>e
ATGGCAAAGGAAAAGTACGAGCGTACTAAGCCGCACGTTAACATCGGTACCATCGGCCACGTCGACCACGGTAAGACTAC
CCTGACTGCCGCCATCTCCAAGGTGCTGCACGAGGAGTTCCCGGATGTCAACCCGGAGTACGACTTCAACCAGATCGATT
CCGCTCCGGAAGAGGCCGCCCGCGGTATCACCATCAACATCGCCCACATCGAGTACCAGACCGAGAAGCGTCACTACGCT
CACGTCGACTGCCCGGGCCACGCCGACTTCGTGAAGAACATGATTACCGGTGCTGCCCAGATGGATGGCGCTATCCTCGT
TGTGGCCGCCACCGACGGCCCGATGGCCCAGACTCGCGAGCACGTGCTGCTCGCCCGTCAGGTTGGCGTTCCGAAGATCC
TCGTCGCCCTGAACAAGTGCGACATGGTCGACGATGAAGAGCTCATCGAGCTCGTCGAAGAAGAGGTCCGCGACCTCCTC
GACGAGAACGGCTTCGACCGTGACTGCCCGGTCATCCACACCTCCGCTTACGGTGCTCTGCACGACGACGCTCCGGACCA
CGAGAAGTGGGTCCAGTCCGTTAAGGACCTCATGGACGCTGTCGACGACTACATCCCGACCCCGGTTCACGACTTGGACA
AGCCGTTCCTGATGCCGATCGAGGACGTCTTCACCATCTCCGGCCGTGGTACCGTTGTCACCGGTCGTGTCGAGCGTGGC
CAGCTGGCCGTCAACACCCCGGTCGAGATCGTTGGTATCCGTCCGACCCAGCAGACCACCGTCACCTCCATCGAGACCTT
CCACAAGACCATGGACGCCTGCGAGGCTGGCGACAACACCGGTCTGCTTCTGCGTGGTCTCGGCCGTGACGATGTCGAGC
GTGGCCAGGTTGTGGCCAAGCCGGGCTCCGTCACCCCGCACACCAAGTTCGAGGGCGAAGTCTACGTGCTGACCAAGGAC
GAAGGCGGCCGTCACTCGCCGTTCTTCTCCAACTACCGTCCGCAGTTCTACTTCCGCACCACCGACGTCACCGGCGTCAT
CGAGCTGCCGGAAGGCGTCGAGATGGTTCAGCCGGGCGACCACGCTACCTTCACCGTTGAGCTGATTCAGCCCATCGCTA
TGGAGGAAGGCCTGACCTTCGCTGTGCGTGAAGGTGGCCACACCGTCGGCTCCGGTCGTGTGACCAAGATCCTCGCCTGA
>f
ATGATGACTGGTGCACAGGCCGCTCACTGTGGTTCTGTATCCGCAATTTCGCTGGGACTGCCCGTTTCCACGGCGATTCC
CGAAGCCAAAGGCACACTGCCGAAAGCCTTGTTCGTAGGCAAAGCGCCGATTTCCGGCAAACTCAAGCAGCGATTGGTGA
ACGAGATCGAATCCATCACCATGCTGGCGTTGGCCCGGTCGGCCAACACGGGGCTTGCGGACGGCAAACTGATCCCGGAA
GTGCTGGTCATCGGACTGAAGCTTACGGCCAAAGCCACCGGCATACCGAACGAAGTCATCGACCTGATCGCCGGACAGCG
CAAATCGGGCATCGTGTTCGTGTGCGTACGCGACGTGCCGTTCGAGGGCAGCACGCGTGAGGAATGCGCGTTCGCCGTGC
GCCGCGCCATACCGGGGCGTGCCGGGCACACGCCGATCTATCAGGTATATGCCGGCGACTGGCAGCCGGCGGGGGAGGCC
CAGCTGGAGCTGGCCGGCTCTACTATTGATGAGCTGTGGGCATCGCTGTGCTCACAGACCATTTTGGGTACGCCGGAGGT
CGAGAATCTGGATGCGCGCATCATTCGCCATACTGAAATTGCTCGGCTCGAATCCGAAGTGGACAAGCTCACTCGTGACC
ATCAGCGCGTCAAGAACCCGGCCCAGCGCAACGAGATCTATGCCAAACTCCACAAGGCCAAGGCGCAGCTCGCAAAACTG
CGCGAGGCATAG