Hi,
Thank you for using our Bio.jl package .
I tried two modifications to improve the performance of k-mer counting in Julia:
- Specifying the
k
parameter as a type parameter.
- Using a naive array instead of a dictionary.
In the following code, count_kmers
is your implementation, count_kmers2
uses the improvement 1, and count_kmers3
uses both of them:
using Bio.Seq
function count_kmers(k::Int64, filenames::Array{String, 1})::Dict{DNAKmer{k}, Int64}
record = FASTASeqRecord{BioSequence{DNAAlphabet{2}}}()
result = Dict{DNAKmer{k}, Int64}()
for filename in filenames
open(FASTAReader{BioSequence{DNAAlphabet{2}}}, filename) do reader
while !eof(reader)
read!(reader, record)
for (pos, kmer) in each(DNAKmer{k}, record.seq)
count = get(result, kmer, 0)
result[kmer] = count + 1
end
end
end
end
return result
end
function count_kmers2{k}(::Type{DNAKmer{k}}, filenames)
record = FASTASeqRecord{BioSequence{DNAAlphabet{2}}}()
result = Dict{DNAKmer{k}, Int64}()
for filename in filenames
open(FASTAReader{BioSequence{DNAAlphabet{2}}}, filename) do reader
while !eof(reader)
read!(reader, record)
for (pos, kmer) in each(DNAKmer{k}, record.seq)
count = get(result, kmer, 0)
result[kmer] = count + 1
end
end
end
end
return result
end
function count_kmers3{k}(::Type{DNAKmer{k}}, filenames)
record = FASTASeqRecord{BioSequence{DNAAlphabet{2}}}()
result = zeros(Int64, 4^k)
for filename in filenames
open(FASTAReader{BioSequence{DNAAlphabet{2}}}, filename) do reader
while !eof(reader)
read!(reader, record)
for (pos, kmer) in each(DNAKmer{k}, record.seq)
@inbounds result[reinterpret(Int64, kmer)+1] += 1
end
end
end
end
return result
end
count_kmers(12, ["out.fa"])
@time count_kmers(12, ["out.fa"])
count_kmers2(DNAKmer{12}, ["out.fa"])
@time count_kmers2(DNAKmer{12}, ["out.fa"])
count_kmers3(DNAKmer{12}, ["out.fa"])
@time count_kmers3(DNAKmer{12}, ["out.fa"])
I tested the performance with 10,000 random DNA sequences of 10,000 bp:
~/.j/v/Bio (master↓2|…) $ julia kmercount.jl
53.595571 seconds (577.54 M allocations: 18.597 GB, 4.33% gc time)
24.425677 seconds (30.23 k allocations: 1.060 GB, 1.51% gc time)
4.528298 seconds (30.05 k allocations: 128.771 MB, 0.34% gc time)
Roughly speaking, the improvement 1 halved the elapsed time and 1+2 was ~13x faster than the original function though using an array will require much more memory than using a dictionary for large k
.
Parallelizing the k-mer counter will be possible. The simplest way would be using the @parallel
macro for the iteration of filenames and then merging the counting results. AFAIK, there is no lock-free hash table implementation in Julia, so if you still need more performance, you may need to make a specialized dictionary for that purpose.
Another option I came up with is using some binary file format for input files. I’m not sure whether parsing FASTA files is a major bottleneck here, but in general parsing text files is much slower than parsing binary files. So, maybe BAM or something can improve the performance further.