Developing a kmer counter in Julia

Hi,
I was looking at some data using the Jellyfish program (a fast parallel kmer counter written in C++) and I was wondering how close I could get to it’s performance in Julia. Just so you know where I’m coming from, my input data is long read Pac-Bio, for E. Coli I have 3 fasta files of about 350MB each,
105451 reads, average read length is about 9610 bp. I wrote the naive counter, here

function count_kmers(k::Int64, filenames::Array{String, 1})::Dict{DNAKmer{k}, Int64}
    record = FASTASeqRecord{BioSequence{DNAAlphabet{2}}}()
    result = Dict{DNAKmer{k}, Int64}()
    for filename in filenames
        open(FASTAReader{BioSequence{DNAAlphabet{2}}}, filename) do reader
            while !eof(reader)
                read!(reader, record)
                for (pos, kmer) in each(DNAKmer{k}, record.seq)
                    count = get(result, kmer, 0)
                    result[kmer] = count + 1
                end
            end
        end
    end
    return result
end

and it performs decently, the obvious optimizations (eg two bit nucleotides and transforming the kmer to an int) coming “for freee” with Bio.jl.

What would be some next steps in getting better performing kmer counters? I’m not really familiar with the literature, besides the Jellyfish paper, which uses a multithreaded, lock-free hash table to exploit parallelism in reading.

3 Likes

Hi,

Thank you for using our Bio.jl package :wink:.

I tried two modifications to improve the performance of k-mer counting in Julia:

  1. Specifying the k parameter as a type parameter.
  2. Using a naive array instead of a dictionary.

In the following code, count_kmers is your implementation, count_kmers2 uses the improvement 1, and count_kmers3 uses both of them:

using Bio.Seq

function count_kmers(k::Int64, filenames::Array{String, 1})::Dict{DNAKmer{k}, Int64}
    record = FASTASeqRecord{BioSequence{DNAAlphabet{2}}}()
    result = Dict{DNAKmer{k}, Int64}()
    for filename in filenames
        open(FASTAReader{BioSequence{DNAAlphabet{2}}}, filename) do reader
            while !eof(reader)
                read!(reader, record)
                for (pos, kmer) in each(DNAKmer{k}, record.seq)
                    count = get(result, kmer, 0)
                    result[kmer] = count + 1
                end
            end
        end
    end
    return result
end

function count_kmers2{k}(::Type{DNAKmer{k}}, filenames)
    record = FASTASeqRecord{BioSequence{DNAAlphabet{2}}}()
    result = Dict{DNAKmer{k}, Int64}()
    for filename in filenames
        open(FASTAReader{BioSequence{DNAAlphabet{2}}}, filename) do reader
            while !eof(reader)
                read!(reader, record)
                for (pos, kmer) in each(DNAKmer{k}, record.seq)
                    count = get(result, kmer, 0)
                    result[kmer] = count + 1
                end
            end
        end
    end
    return result
end

function count_kmers3{k}(::Type{DNAKmer{k}}, filenames)
    record = FASTASeqRecord{BioSequence{DNAAlphabet{2}}}()
    result = zeros(Int64, 4^k)
    for filename in filenames
        open(FASTAReader{BioSequence{DNAAlphabet{2}}}, filename) do reader
            while !eof(reader)
                read!(reader, record)
                for (pos, kmer) in each(DNAKmer{k}, record.seq)
                    @inbounds result[reinterpret(Int64, kmer)+1] += 1
                end
            end
        end
    end
    return result
end

count_kmers(12, ["out.fa"])
@time count_kmers(12, ["out.fa"])

count_kmers2(DNAKmer{12}, ["out.fa"])
@time count_kmers2(DNAKmer{12}, ["out.fa"])

count_kmers3(DNAKmer{12}, ["out.fa"])
@time count_kmers3(DNAKmer{12}, ["out.fa"])

I tested the performance with 10,000 random DNA sequences of 10,000 bp:

~/.j/v/Bio (master↓2|…) $ julia kmercount.jl
 53.595571 seconds (577.54 M allocations: 18.597 GB, 4.33% gc time)
 24.425677 seconds (30.23 k allocations: 1.060 GB, 1.51% gc time)
  4.528298 seconds (30.05 k allocations: 128.771 MB, 0.34% gc time)

Roughly speaking, the improvement 1 halved the elapsed time and 1+2 was ~13x faster than the original function though using an array will require much more memory than using a dictionary for large k.

Parallelizing the k-mer counter will be possible. The simplest way would be using the @parallel macro for the iteration of filenames and then merging the counting results. AFAIK, there is no lock-free hash table implementation in Julia, so if you still need more performance, you may need to make a specialized dictionary for that purpose.

Another option I came up with is using some binary file format for input files. I’m not sure whether parsing FASTA files is a major bottleneck here, but in general parsing text files is much slower than parsing binary files. So, maybe BAM or something can improve the performance further.

3 Likes

Thank you for developing BioJulia! It’s a testament to its good design that I was able to immediately write the naive code and it handled the basic optimizations for me.

That’s terrific that specifying k as a type parameter is another quick win. I’m still getting used to Julia, I should have seen that one. Thanks so much!

I thought about using an array but I’m looking at kmers with k roughly 14-18 right now. A 12-mer array will be about (4 ^ 12) * 8 == 134_217_728 bytes, but an 18-mer is 549_755_813_888 bytes.

Your simple suggestion for parallelizing makes sense, I’ll give it a shot. My worry of course is that memory gets swamped by having n tables.

I think that to do an implementation using lock free queues and tables a-la Jellyfish we’d need some low level atomic CAS instruction, which I don’t think exists in Julia; I looked and found nothing. Any hope for something along these lines in future Julia?

In Julia v0.5 we do have support for atomics with Base.Threads.Atomic

2 Likes

Thanks, I don’t know how I missed that!

Maybe 2-bit will help here???

Yes, maybe. But the 2bit file format is specialized for genome sequences and may not be the best tool for storing millions of short/medium-length sequences. BAM or other PacBio file formats would be a better choice here.

This is an old thread, but I thought I would mention that reads can now be stored in binary on disk, and accessed randomly or buffered in a stream thanks to BioJulia/ReadDatastores.jl

2 Likes