ANN - Towards a (Bio)Julia powered Genome Graphs framework

Was that in a fresh new project? Can I see what happens upon Pkg add? GenomeGraphs currently doesen’t depend on BioAlignments so I’m a bit confused without seeing the console msges.

It works in a fresh environment yes, and it’s indeed BioAlignments that is causing an incompatibility (I wanted to get reads from a bam file).

Hello lovely fellow Julians!

I haven’t left an update since January so I thought I had better fill you in on progress on the genome graph framework.

ReadDatastores has gotten an update and a new release pending which allows you to choose to create stores for LongSequence{DNAAlphabet{2}}s or LongSequence{DNAAlphabet{4}}s now, rather than forcing you to use only DNAAlphabet{4} - so if you know your reads dont have IUPAC ambiguities, this will likely save you a lot more disk-space and may be faster (haven’t benchmarked yet).

A bug fix has also made it’s way into ReadDatastores which should reduce the number of buffer refreshes that occur, when streaming sequences from a buffered datastore.

Now ReadDatastores is in a state I like a bit more, I’m going to first update GenomeGraphs to use the new ReadDatastores and begin breaking out some of the Kmer counter data-structures, as the ability to project kmer counts and coverage over sequences in general - not just graphs - is - I’m finding - an insanely useful thing to do!

4 Likes

Hey guys, some progress and updates on the genome assembly and genome graph analysis framework:

I took the time to start breaking out the kmer counting functionality - barebones as it was - into its own package MerCounting.jl.

Currently the serial and all in RAM method of counting kmers that is used for GenomeGraphs.jl’s deBruijn graph builder made it in, but I wanted a more serious kmer counter algorithm, or rather a more serious choice.

So I’m thinking of 3 counter methods to begin with: 1). The serial-in-RAM one (called serial_mem) that already exists, for when the problem is trivial or memory isn’t and issue - I have access to a cluster configured for large memory requirements rather than speed, and sometimes its quicker and easier to just throw a big vector at the thing rather than spend time on something more clever. 2). A distributed-in-RAM one that uses “Distributed” (called dist_mem) - again not so concerned with memory, but speeds up serial_mem by cutting the ReadDatastores into equal-ish chunks and processes them according to serial_mem and then combines the result. I guess if your julia processes were spread across multiple machines in a cluster or such, this could help with memory. 3). A distributed counter that uses disk batches. This one does a pre-analysis of a ReadDatastore to look at the distribution of minimizers, and will allocate N “bins” that the minimizers are distributed across as evenly as possible. Then multiple processes will divide up the reads, kmerize them, and throw the kmers into RemoteChannels (bins) based on their minimizer. These channels are read by a series of writer processes that send the kmer to a disk file.
These bin files can then be processes individually and in parallel, and unlike the other two counting methods, which return a vector of sroted counts, this will probably return a KmerCountDatastore or some other disk backed store of the kmer counts (similar to the ReadDatastores I guess).

Anyway here’s a preview of testing that dist_mem works:

julia> using Distributed, BioSequences, ReadDatastores

julia> addprocs(4, exeflags="--project=.")
4-element Array{Int64,1}:
 2
 3
 4
 5

julia> @everywhere using MerCounting

julia> ds = buffer(@openreads "ecoli-pe.prseq")
Buffered Paired Read Datastore 'ecoli-pe': 20 reads (10 pairs)

julia> o = MerCounting.Counters.dist_mem(DNAMer{31}, ds, MerCounting.CANONICAL)
[ Info: Splitting all reads in ecoli-pe, across 5 processes and counting kmers
[ Info: [Process: 1] Collecting kmers from 4 reads in read datastore ecoli-pe
[ Info: [Process: 1] Collapsing collected kmers into counts
[ Info: [Process: 2] Collecting kmers from 4 reads in read datastore ecoli-pe
[ Info: [Process: 2] Collapsing collected kmers into counts
[ Info: [Process: 3] Collecting kmers from 4 reads in read datastore ecoli-pe
[ Info: [Process: 3] Collapsing collected kmers into counts
[ Info: [Process: 4] Collecting kmers from 4 reads in read datastore ecoli-pe
[ Info: [Process: 4] Collapsing collected kmers into counts
[ Info: [Process: 5] Collecting kmers from 4 reads in read datastore ecoli-pe
[ Info: [Process: 5] Collapsing collected kmers into counts
[ Info: Merging all counts from 5 processes
4916-element Array{MerCount{Mer{DNAAlphabet{2},31}},1}:
 AAAAAAAAGTAATCGTCGGCATGTCCGGCGG occurs 1 times
 AAAAAAAGTAATCGTCGGCATGTCCGGCGGT occurs 1 times
 AAAAAAATAAAATTTAGTGCTGTACAGAGCG occurs 1 times
 AAAAAAGTAATCGTCGGCATGTCCGGCGGTG occurs 1 times
 AAAAAATAAAATTTAGTGCTGTACAGAGCGC occurs 1 times
 AAAAAATCCACCAGCAGCAGGTTGATGTCAG occurs 1 times
 AAAAAATCGACCAGGAGCGGTTGGATGACAG occurs 1 times
 AAAAAATCGTTGCGCTTTTTCTCGCCGCCGG occurs 1 times
 AAAAACATCGCGGGTGTTGAAGAAAAATTAA occurs 1 times
 AAAAACCAACGTTTACCGAACGGGTTAATAA occurs 2 times
 AAAAACTGGTTACTGACACCTGGAATCTCCA occurs 1 times
 AAAAACTTCCTCAAACCAATGCTGATTGTGT occurs 1 times
 AAAAAGCACTATTTCATCCTTGTTGCTGAAG occurs 1 times
 AAAAAGCGCAACGATATTTTGCAAATGGCGG occurs 1 times
 AAAAAGTAATCGTCGGCATGTCCGGCGGTGT occurs 1 times
 AAAAATAAAATTTAGTGCTGTACAGAGCGCG occurs 1 times
 AAAAATCCACCAGCAGCAGGTTGATGTCAGC occurs 1 times
 AAAAATCGACCAGGAGCGGTTGGATGACAGC occurs 1 times
 AAAAATCGTTGCGCTTTTTCTCGCCGCCGGA occurs 1 times
 AAAAATTAACCTATACCGATACCTACGCGCA occurs 1 times
 AAAACAATCCCTTCTATCAATTATATCGGCT occurs 1 times
 AAAACACCGTTATTCATTGCTTCTACCCGTG occurs 1 times
 AAAACAGCGTATTGAAAGCCTTCATGTAAAA occurs 1 times
 AAAACATCGCGGGTGTTGAAGAAAAATTAAC occurs 1 times
 AAAACCAACGTTTACCGAACGGGTTAATAAA occurs 2 times
 ⋮                                             
 TGTTAAGGACTATCTTGTTAAATGCTCGGAA occurs 1 times
 TGTTTATTCGCACCGGCGACGTAGACGGCAA occurs 2 times
 TTAACCTATACCGATACCTACGCGCAGGAAA occurs 1 times
 TTAAGAAGTTGTTAAGGACTATCTTGTTAAA occurs 1 times
 TTACACCGGCGGTGACCAAAAACTTCCTCAA occurs 1 times
 TTACGACCTTGTGTAAACCAGTGGACGCTAA occurs 1 times
 TTACGTGGTGATCACCATTACCCCGAGCGAA occurs 1 times
 TTACTAAGGCCCGTTTTGATGCCGTCGCCAA occurs 1 times
 TTACTGACGTATCATTATCTGGTAGAAAAAA occurs 1 times
 TTAGGGGCCAACGTCGATTGTGACAGCACAA occurs 1 times
 TTATCTGTGTAACCCTTTTGCGTTGATAGAA occurs 1 times
 TTATTCCCCATGCTTCAGCAACAAGGATGAA occurs 1 times
 TTCAATGTCGTGAGTGATCCAATGTCTGAAA occurs 1 times
 TTCAATTTGCCTCCATTGGTGCAACCACCAA occurs 1 times
 TTCCTCCGTTTCTTCCTGGCTGTAGCAACAA occurs 1 times
 TTCGATTCTTCTTTGTCACCGCAGCCAGCAA occurs 1 times
 TTCGGCGTGCGACCGGCTTTATATTCGGCAA occurs 1 times
 TTCTGGCTGAAGAGGTGGTGGAAATTCCCAA occurs 1 times
 TTCTTGCTTAAGAGGTGGTGGAAATTCCCAA occurs 1 times
 TTGCCATTACCTATGTCTACAAAGGTGACAA occurs 1 times
 TTGCGAACAATTATTCCCCATGCTTCAGCAA occurs 1 times
 TTGCGCTTTTTCTCGCCGCCGGAAAAACCAA occurs 2 times
 TTGGCAGCATCTTCTTTGGTGGTTGCACCAA occurs 1 times
 TTGTCCCAACACATGTGCCAGCCGATAAAAA occurs 1 times
 TTTGATTTTCAGGATTTGATGGAAGAGAAAA occurs 1 times

The third more complex counter is on its way - I have code already for the minimizer analysis for deciding on the disk bins.

3 Likes

Hi All,

I’ve been fairly quite for a while so I thought an update might be appropriate.

One of the components that is super important for an assembler / contigger that is truly generally haplotype specific is the ability to project kmer coverage and counts into the graph. Doing this allows the boundaries of assembly sub-problems to be defined and solved iteratively to resolve the genome and simplify the graph.

But projecting kmer coverage is a generally useful tool as well, as are other analyses for example computing kmer spectra.

So KmerCounting.jl is now KmerAnalysis.jl, and will have the following to mark an initial release:

Kmer counter types, that count kmers in an input, and behave somewhat like functors. Here’s an example of the serial and in memory counter now:

struct SerialMemCounter{M<:AbstractMer,C<:CountMode} <: Function
    mer_buffer::Vector{M}
    count_buffer::Vector{MerCount{M}}
end

serial_mem(::Type{M}, count_mode::C) where {M<:AbstractMer,C<:CountMode} = SerialMemCounter{M,C}(Vector{M}(), Vector{MerCount{M}}())

function count!(smc::SerialMemCounter{M,C}, input, args...) where {M<:AbstractMer,C<:CountMode}
    _serial_mem!(smc.count_buffer, smc.mer_buffer, C(), input, args...)
end

function (smc::SerialMemCounter{M,C})(input, args...) where {M<:AbstractMer,C}
    count!(smc, input, args...)
    return copy(smc.count_buffer)
end

function _serial_mem!(count_buffer::Vector{MerCount{M}}, mer_buffer::Vector{M}, count_mode::CountMode, input, args...) where {M<:AbstractMer}
    collect_mers!(mer_buffer, count_mode, input, args...)
    collapse_into_counts!(count_buffer, mer_buffer)
end

I think defining kmer counters this way has a few advantages. First they can be passed around and treated like functions that customize the behaviour of an analysis. For example to customize how the counting is done during a kmer spectra analysis:

c = serial_mem(DNAMer{31}, CANONICAL)
speccy = spectra(c, my_reads)

Where the spectra method is something like:

function spectra(counter::C, input, min_count::Integer = 0) where {C<:AbstractKmerCounter}
    return spectra(counter(input), min_count)
end

These counter types could also edit the behaviour of code they are passed through simply by altering dispatch too, not just by being called as a function.

I think this is quite a nice API for this kind of thing, but I know it is possible to produce something the compiler finds hard to optimize when passing functions about so any reviews of the package when it’s released are most welcome.

As I hinted the package will also provide functionality to compute kmer spectra. These are useful to check the distribution of kmer frequencies in the reads before attempting a genome assembly. For example:

A plot of a kmer spectra from reads like this shows me the main sampling distribution and the errors in the 1 column are very distinct cutting off the kmer count at a threshold of 2 or three for DBG construction should retain most of the information, and discard most error.

You can also have 2D kmer spectra. They are useful for producing spectra copy number plots
which should be your first check that an assembly, at any stage is any good. Indeed we tell people at our institute “Spectra CN or your assembly didn’t happen”

spec = spectra(kc_reads, kc_genom)

# Basic recipe for cn plot
heights = cumsum(KmerAnalysis._find_plottable_subset(spec); dims = 2)
s = barplot([Point2f0(i - 1, j) for (i, j) in enumerate(heights[:,7])], color = :orange)
barplot!(s, [Point2f0(i - 1, j) for (i, j) in enumerate(heights[:,6])], color = :yellow)
barplot!(s, [Point2f0(i - 1, j) for (i, j) in enumerate(heights[:,5])], color = :blue)
barplot!(s, [Point2f0(i - 1, j) for (i, j) in enumerate(heights[:,4])], color = :green)
barplot!(s, [Point2f0(i - 1, j) for (i, j) in enumerate(heights[:,3])], color = :purple)
barplot!(s, [Point2f0(i - 1, j) for (i, j) in enumerate(heights[:,2])], color = :red)
barplot!(s, [Point2f0(i - 1, j) for (i, j) in enumerate(heights[:,1])], color = :black)
xlabel!(s, "K-mer multiplicity")
ylabel!(s, "Number of distinct k-mers")

It needs a legend but the colours basically correspond to how many times a kmer appears in the genome, (0 = black, red = 1, orange = 2 etc…) but this CN plot tells me that since the first (and only) peak (which corresponds to the first (and only) haplotype) is coloured red, the kmers from that peak appear once in the genome. Then the part of the spectra corresponding to sequencing errors is black - they appear 0 times in the genome, which is great because before you consider contiguity or N50 or whatever, you should first make sure you assembly has the correct motifs, the right number of times, and in the right order. The cn plot validates the first two criteria.

So the spectra and cn plots will be important for people working with the GenomeGraphs API to check as they go that they haven’t lost any genome content or even made some up which does happen!

Genomes with more than one haplotype, diploids, triploids and so on, get a bit more complex but the cn plot is simple enough to interpret - every peak corresponds to a level of coverage 1x, 2x and so on, and so a plot from a good haplotype-specific assembly will colour each of the peaks the correct colour.

Here’s an example from a decent diploid assembly where the assembler collapses bubbles (heterozyogous content) and basically chooses one of the two variants. Note the black in the first peak is the bubble halfs the assembler is discarding.

Whereas a correct haplotype specific assembler would look something like this:

So the ability to do these checks is provided in KmerAnalysis.jl. This will be the first package I’ve done where I’ve had to really try to do some nice visualisation, and I’m liking Makie so for, there are a few things regarding recipe’s I don’t get or can’t do - I’ve found I can edit x and y labels in plots, I don’t know if I can add colour legends yet, because recipes work on plots, not scenes.

The other datatype I’ve been working on is IndexedCounts.
So the idea with indexed counts is you first construct one from some sequences you want to use as an index (normally a genome assembly or other reference genome). Then you add many counts from different data sources - short reads, long reads, reads from different individuals and so on.

julia> KmerAnalysis.index!(ikc, genome) # TODO make this happen during construction.
[ Info: Indexing kmer counts of reference sequences
[ Info: Populating index with reference sequence mers
[ Info: Sorting the index
[ Info: Counting and collapsing mers in index
31-mer Indexed Counts Datastore 'ecoli': 1 stored 31-mer counts

julia> # Add a count called :pe_reads to the indexed counts collection.
       add_count!(ikc, :pe_reads, reads)

You can then project the counts from an IndexedCount onto some new query sequence.
Here I project the counts from the reads, back onto the first sequence of the genome:

lines(project_count(ikc, :pe_reads, genome[1]))

Most of the sequence is covered by the read kmers between 10 and 20 times, and there are some very high portions, overrepresented or very common motifs perhaps, and some regions clearly more covered than others. Projecting kmer count data over genome graph sequences in this way, and also - more importantly - it’s transformed KCI or Kmer Count Index equivalent (more on that later) is a central and critical tool to both do and describe and assess a haplotype-specific assembly graph.

Anyway KmerAnalysis.jl is comming given its utility in GenomeGraphs.jl and elsewhere I hope people end up taking a look and don’t hesitate to improve it or point out things to be improved.

8 Likes

Hi, I’m testing GenomeGraphs v0.2.0 for a very small and simple sequence assembly, however, cannot view assembled sequences.

using GenomeGraphs
using ReadDatastores
using FASTX
using CodecZlib

fwq = FASTQ.Reader(GzipDecompressorStream(open("TB08-Pos-OnBead_S7_L001_R1_001.fastq.gz")))
rvq = FASTQ.Reader(GzipDecompressorStream(open("TB08-Pos-OnBead_S7_L001_R2_001.fastq.gz")))

#Prepare the sequencing reads & build a datastore.
ds = PairedReads{DNAAlphabet{4}}(fwq, rvq, "test-paired", "my-test", 50, 200, 0, FwRv)
#Add the read datastore to a WorkSpace.
ws = WorkSpace()
add_paired_reads!(ws, ds)
#Run the dbg process.
dbg!(ws, BigDNAMer{63}, 10, Symbol("my-test"))
#Run the tip removal process.
remove_tips!(ws, 200)

n = node(ws, 3)

All commands above are working, except the last.

help?> node
search: NodeView ncodeunits QuoteNode isunordered transcode parentmodule RoundingMode CanonicalIndexError

Couldn't find node
Perhaps you meant nor, name, one, Core, Some, mod, mod1, modf, ntoh, invoke, names, nand, new, angle, ones or undef
  No documentation found.

  Binding node does not exist.

Your help is highly appreciated! I look forward to seeing one Julia tool can work on one of the most challenging tasks in bioinformatics! Thank you!