Was that in a fresh new project? Can I see what happens upon Pkg add? GenomeGraphs currently doesen’t depend on BioAlignments so I’m a bit confused without seeing the console msges.
It works in a fresh environment yes, and it’s indeed BioAlignments that is causing an incompatibility (I wanted to get reads from a bam file).
Hello lovely fellow Julians!
I haven’t left an update since January so I thought I had better fill you in on progress on the genome graph framework.
ReadDatastores has gotten an update and a new release pending which allows you to choose to create stores for LongSequence{DNAAlphabet{2}}
s or LongSequence{DNAAlphabet{4}}
s now, rather than forcing you to use only DNAAlphabet{4}
- so if you know your reads dont have IUPAC ambiguities, this will likely save you a lot more disk-space and may be faster (haven’t benchmarked yet).
A bug fix has also made it’s way into ReadDatastores which should reduce the number of buffer refreshes that occur, when streaming sequences from a buffered datastore.
Now ReadDatastores is in a state I like a bit more, I’m going to first update GenomeGraphs to use the new ReadDatastores and begin breaking out some of the Kmer counter data-structures, as the ability to project kmer counts and coverage over sequences in general - not just graphs - is - I’m finding - an insanely useful thing to do!
Hey guys, some progress and updates on the genome assembly and genome graph analysis framework:
I took the time to start breaking out the kmer counting functionality - barebones as it was - into its own package MerCounting.jl.
Currently the serial and all in RAM method of counting kmers that is used for GenomeGraphs.jl’s deBruijn graph builder made it in, but I wanted a more serious kmer counter algorithm, or rather a more serious choice.
So I’m thinking of 3 counter methods to begin with: 1). The serial-in-RAM one (called serial_mem) that already exists, for when the problem is trivial or memory isn’t and issue - I have access to a cluster configured for large memory requirements rather than speed, and sometimes its quicker and easier to just throw a big vector at the thing rather than spend time on something more clever. 2). A distributed-in-RAM one that uses “Distributed” (called dist_mem) - again not so concerned with memory, but speeds up serial_mem by cutting the ReadDatastores into equal-ish chunks and processes them according to serial_mem and then combines the result. I guess if your julia processes were spread across multiple machines in a cluster or such, this could help with memory. 3). A distributed counter that uses disk batches. This one does a pre-analysis of a ReadDatastore to look at the distribution of minimizers, and will allocate N “bins” that the minimizers are distributed across as evenly as possible. Then multiple processes will divide up the reads, kmerize them, and throw the kmers into RemoteChannels (bins) based on their minimizer. These channels are read by a series of writer processes that send the kmer to a disk file.
These bin files can then be processes individually and in parallel, and unlike the other two counting methods, which return a vector of sroted counts, this will probably return a KmerCountDatastore or some other disk backed store of the kmer counts (similar to the ReadDatastores I guess).
Anyway here’s a preview of testing that dist_mem works:
julia> using Distributed, BioSequences, ReadDatastores
julia> addprocs(4, exeflags="--project=.")
4-element Array{Int64,1}:
2
3
4
5
julia> @everywhere using MerCounting
julia> ds = buffer(@openreads "ecoli-pe.prseq")
Buffered Paired Read Datastore 'ecoli-pe': 20 reads (10 pairs)
julia> o = MerCounting.Counters.dist_mem(DNAMer{31}, ds, MerCounting.CANONICAL)
[ Info: Splitting all reads in ecoli-pe, across 5 processes and counting kmers
[ Info: [Process: 1] Collecting kmers from 4 reads in read datastore ecoli-pe
[ Info: [Process: 1] Collapsing collected kmers into counts
[ Info: [Process: 2] Collecting kmers from 4 reads in read datastore ecoli-pe
[ Info: [Process: 2] Collapsing collected kmers into counts
[ Info: [Process: 3] Collecting kmers from 4 reads in read datastore ecoli-pe
[ Info: [Process: 3] Collapsing collected kmers into counts
[ Info: [Process: 4] Collecting kmers from 4 reads in read datastore ecoli-pe
[ Info: [Process: 4] Collapsing collected kmers into counts
[ Info: [Process: 5] Collecting kmers from 4 reads in read datastore ecoli-pe
[ Info: [Process: 5] Collapsing collected kmers into counts
[ Info: Merging all counts from 5 processes
4916-element Array{MerCount{Mer{DNAAlphabet{2},31}},1}:
AAAAAAAAGTAATCGTCGGCATGTCCGGCGG occurs 1 times
AAAAAAAGTAATCGTCGGCATGTCCGGCGGT occurs 1 times
AAAAAAATAAAATTTAGTGCTGTACAGAGCG occurs 1 times
AAAAAAGTAATCGTCGGCATGTCCGGCGGTG occurs 1 times
AAAAAATAAAATTTAGTGCTGTACAGAGCGC occurs 1 times
AAAAAATCCACCAGCAGCAGGTTGATGTCAG occurs 1 times
AAAAAATCGACCAGGAGCGGTTGGATGACAG occurs 1 times
AAAAAATCGTTGCGCTTTTTCTCGCCGCCGG occurs 1 times
AAAAACATCGCGGGTGTTGAAGAAAAATTAA occurs 1 times
AAAAACCAACGTTTACCGAACGGGTTAATAA occurs 2 times
AAAAACTGGTTACTGACACCTGGAATCTCCA occurs 1 times
AAAAACTTCCTCAAACCAATGCTGATTGTGT occurs 1 times
AAAAAGCACTATTTCATCCTTGTTGCTGAAG occurs 1 times
AAAAAGCGCAACGATATTTTGCAAATGGCGG occurs 1 times
AAAAAGTAATCGTCGGCATGTCCGGCGGTGT occurs 1 times
AAAAATAAAATTTAGTGCTGTACAGAGCGCG occurs 1 times
AAAAATCCACCAGCAGCAGGTTGATGTCAGC occurs 1 times
AAAAATCGACCAGGAGCGGTTGGATGACAGC occurs 1 times
AAAAATCGTTGCGCTTTTTCTCGCCGCCGGA occurs 1 times
AAAAATTAACCTATACCGATACCTACGCGCA occurs 1 times
AAAACAATCCCTTCTATCAATTATATCGGCT occurs 1 times
AAAACACCGTTATTCATTGCTTCTACCCGTG occurs 1 times
AAAACAGCGTATTGAAAGCCTTCATGTAAAA occurs 1 times
AAAACATCGCGGGTGTTGAAGAAAAATTAAC occurs 1 times
AAAACCAACGTTTACCGAACGGGTTAATAAA occurs 2 times
⋮
TGTTAAGGACTATCTTGTTAAATGCTCGGAA occurs 1 times
TGTTTATTCGCACCGGCGACGTAGACGGCAA occurs 2 times
TTAACCTATACCGATACCTACGCGCAGGAAA occurs 1 times
TTAAGAAGTTGTTAAGGACTATCTTGTTAAA occurs 1 times
TTACACCGGCGGTGACCAAAAACTTCCTCAA occurs 1 times
TTACGACCTTGTGTAAACCAGTGGACGCTAA occurs 1 times
TTACGTGGTGATCACCATTACCCCGAGCGAA occurs 1 times
TTACTAAGGCCCGTTTTGATGCCGTCGCCAA occurs 1 times
TTACTGACGTATCATTATCTGGTAGAAAAAA occurs 1 times
TTAGGGGCCAACGTCGATTGTGACAGCACAA occurs 1 times
TTATCTGTGTAACCCTTTTGCGTTGATAGAA occurs 1 times
TTATTCCCCATGCTTCAGCAACAAGGATGAA occurs 1 times
TTCAATGTCGTGAGTGATCCAATGTCTGAAA occurs 1 times
TTCAATTTGCCTCCATTGGTGCAACCACCAA occurs 1 times
TTCCTCCGTTTCTTCCTGGCTGTAGCAACAA occurs 1 times
TTCGATTCTTCTTTGTCACCGCAGCCAGCAA occurs 1 times
TTCGGCGTGCGACCGGCTTTATATTCGGCAA occurs 1 times
TTCTGGCTGAAGAGGTGGTGGAAATTCCCAA occurs 1 times
TTCTTGCTTAAGAGGTGGTGGAAATTCCCAA occurs 1 times
TTGCCATTACCTATGTCTACAAAGGTGACAA occurs 1 times
TTGCGAACAATTATTCCCCATGCTTCAGCAA occurs 1 times
TTGCGCTTTTTCTCGCCGCCGGAAAAACCAA occurs 2 times
TTGGCAGCATCTTCTTTGGTGGTTGCACCAA occurs 1 times
TTGTCCCAACACATGTGCCAGCCGATAAAAA occurs 1 times
TTTGATTTTCAGGATTTGATGGAAGAGAAAA occurs 1 times
The third more complex counter is on its way - I have code already for the minimizer analysis for deciding on the disk bins.
Hi All,
I’ve been fairly quite for a while so I thought an update might be appropriate.
One of the components that is super important for an assembler / contigger that is truly generally haplotype specific is the ability to project kmer coverage and counts into the graph. Doing this allows the boundaries of assembly sub-problems to be defined and solved iteratively to resolve the genome and simplify the graph.
But projecting kmer coverage is a generally useful tool as well, as are other analyses for example computing kmer spectra.
So KmerCounting.jl
is now KmerAnalysis.jl
, and will have the following to mark an initial release:
Kmer counter types, that count kmers in an input, and behave somewhat like functors. Here’s an example of the serial and in memory counter now:
struct SerialMemCounter{M<:AbstractMer,C<:CountMode} <: Function
mer_buffer::Vector{M}
count_buffer::Vector{MerCount{M}}
end
serial_mem(::Type{M}, count_mode::C) where {M<:AbstractMer,C<:CountMode} = SerialMemCounter{M,C}(Vector{M}(), Vector{MerCount{M}}())
function count!(smc::SerialMemCounter{M,C}, input, args...) where {M<:AbstractMer,C<:CountMode}
_serial_mem!(smc.count_buffer, smc.mer_buffer, C(), input, args...)
end
function (smc::SerialMemCounter{M,C})(input, args...) where {M<:AbstractMer,C}
count!(smc, input, args...)
return copy(smc.count_buffer)
end
function _serial_mem!(count_buffer::Vector{MerCount{M}}, mer_buffer::Vector{M}, count_mode::CountMode, input, args...) where {M<:AbstractMer}
collect_mers!(mer_buffer, count_mode, input, args...)
collapse_into_counts!(count_buffer, mer_buffer)
end
I think defining kmer counters this way has a few advantages. First they can be passed around and treated like functions that customize the behaviour of an analysis. For example to customize how the counting is done during a kmer spectra analysis:
c = serial_mem(DNAMer{31}, CANONICAL)
speccy = spectra(c, my_reads)
Where the spectra method is something like:
function spectra(counter::C, input, min_count::Integer = 0) where {C<:AbstractKmerCounter}
return spectra(counter(input), min_count)
end
These counter types could also edit the behaviour of code they are passed through simply by altering dispatch too, not just by being called as a function.
I think this is quite a nice API for this kind of thing, but I know it is possible to produce something the compiler finds hard to optimize when passing functions about so any reviews of the package when it’s released are most welcome.
As I hinted the package will also provide functionality to compute kmer spectra. These are useful to check the distribution of kmer frequencies in the reads before attempting a genome assembly. For example:
A plot of a kmer spectra from reads like this shows me the main sampling distribution and the errors in the 1 column are very distinct cutting off the kmer count at a threshold of 2 or three for DBG construction should retain most of the information, and discard most error.
You can also have 2D kmer spectra. They are useful for producing spectra copy number plots
which should be your first check that an assembly, at any stage is any good. Indeed we tell people at our institute “Spectra CN or your assembly didn’t happen”
spec = spectra(kc_reads, kc_genom)
# Basic recipe for cn plot
heights = cumsum(KmerAnalysis._find_plottable_subset(spec); dims = 2)
s = barplot([Point2f0(i - 1, j) for (i, j) in enumerate(heights[:,7])], color = :orange)
barplot!(s, [Point2f0(i - 1, j) for (i, j) in enumerate(heights[:,6])], color = :yellow)
barplot!(s, [Point2f0(i - 1, j) for (i, j) in enumerate(heights[:,5])], color = :blue)
barplot!(s, [Point2f0(i - 1, j) for (i, j) in enumerate(heights[:,4])], color = :green)
barplot!(s, [Point2f0(i - 1, j) for (i, j) in enumerate(heights[:,3])], color = :purple)
barplot!(s, [Point2f0(i - 1, j) for (i, j) in enumerate(heights[:,2])], color = :red)
barplot!(s, [Point2f0(i - 1, j) for (i, j) in enumerate(heights[:,1])], color = :black)
xlabel!(s, "K-mer multiplicity")
ylabel!(s, "Number of distinct k-mers")
It needs a legend but the colours basically correspond to how many times a kmer appears in the genome, (0 = black, red = 1, orange = 2 etc…) but this CN plot tells me that since the first (and only) peak (which corresponds to the first (and only) haplotype) is coloured red, the kmers from that peak appear once in the genome. Then the part of the spectra corresponding to sequencing errors is black - they appear 0 times in the genome, which is great because before you consider contiguity or N50 or whatever, you should first make sure you assembly has the correct motifs, the right number of times, and in the right order. The cn plot validates the first two criteria.
So the spectra and cn plots will be important for people working with the GenomeGraphs API to check as they go that they haven’t lost any genome content or even made some up which does happen!
Genomes with more than one haplotype, diploids, triploids and so on, get a bit more complex but the cn plot is simple enough to interpret - every peak corresponds to a level of coverage 1x, 2x and so on, and so a plot from a good haplotype-specific assembly will colour each of the peaks the correct colour.
Here’s an example from a decent diploid assembly where the assembler collapses bubbles (heterozyogous content) and basically chooses one of the two variants. Note the black in the first peak is the bubble halfs the assembler is discarding.
Whereas a correct haplotype specific assembler would look something like this:
So the ability to do these checks is provided in KmerAnalysis.jl
. This will be the first package I’ve done where I’ve had to really try to do some nice visualisation, and I’m liking Makie so for, there are a few things regarding recipe’s I don’t get or can’t do - I’ve found I can edit x and y labels in plots, I don’t know if I can add colour legends yet, because recipes work on plots, not scenes.
The other datatype I’ve been working on is IndexedCounts.
So the idea with indexed counts is you first construct one from some sequences you want to use as an index (normally a genome assembly or other reference genome). Then you add many counts from different data sources - short reads, long reads, reads from different individuals and so on.
julia> KmerAnalysis.index!(ikc, genome) # TODO make this happen during construction.
[ Info: Indexing kmer counts of reference sequences
[ Info: Populating index with reference sequence mers
[ Info: Sorting the index
[ Info: Counting and collapsing mers in index
31-mer Indexed Counts Datastore 'ecoli': 1 stored 31-mer counts
julia> # Add a count called :pe_reads to the indexed counts collection.
add_count!(ikc, :pe_reads, reads)
You can then project the counts from an IndexedCount onto some new query sequence.
Here I project the counts from the reads, back onto the first sequence of the genome:
lines(project_count(ikc, :pe_reads, genome[1]))
Most of the sequence is covered by the read kmers between 10 and 20 times, and there are some very high portions, overrepresented or very common motifs perhaps, and some regions clearly more covered than others. Projecting kmer count data over genome graph sequences in this way, and also - more importantly - it’s transformed KCI or Kmer Count Index equivalent (more on that later) is a central and critical tool to both do and describe and assess a haplotype-specific assembly graph.
Anyway KmerAnalysis.jl is comming given its utility in GenomeGraphs.jl and elsewhere I hope people end up taking a look and don’t hesitate to improve it or point out things to be improved.
Hi, I’m testing GenomeGraphs v0.2.0 for a very small and simple sequence assembly, however, cannot view assembled sequences.
using GenomeGraphs
using ReadDatastores
using FASTX
using CodecZlib
fwq = FASTQ.Reader(GzipDecompressorStream(open("TB08-Pos-OnBead_S7_L001_R1_001.fastq.gz")))
rvq = FASTQ.Reader(GzipDecompressorStream(open("TB08-Pos-OnBead_S7_L001_R2_001.fastq.gz")))
#Prepare the sequencing reads & build a datastore.
ds = PairedReads{DNAAlphabet{4}}(fwq, rvq, "test-paired", "my-test", 50, 200, 0, FwRv)
#Add the read datastore to a WorkSpace.
ws = WorkSpace()
add_paired_reads!(ws, ds)
#Run the dbg process.
dbg!(ws, BigDNAMer{63}, 10, Symbol("my-test"))
#Run the tip removal process.
remove_tips!(ws, 200)
n = node(ws, 3)
All commands above are working, except the last.
help?> node
search: NodeView ncodeunits QuoteNode isunordered transcode parentmodule RoundingMode CanonicalIndexError
Couldn't find node
Perhaps you meant nor, name, one, Core, Some, mod, mod1, modf, ntoh, invoke, names, nand, new, angle, ones or undef
No documentation found.
Binding node does not exist.
Your help is highly appreciated! I look forward to seeing one Julia tool can work on one of the most challenging tasks in bioinformatics! Thank you!