ANN - Towards a (Bio)Julia powered Genome Graphs framework

Hello everyone, the GSoC period for the year is nearly over, and for anyone who was unaware, BioJulia had a student this year: Arda Akdemir has been helping the project under my supervision, working on a de-Bruijn graph constructor from a list of canonical kmers. Arda has also been working on error correction of such graphs in the form of bubble popping and tip-clipping.

Some people who have watched my JuliaCon2018 talk might remember I ended my rant by making noises about genome graph data structures being the natural representation of a genome assembly, and that we should be moving an iterative model of bioinformatics of accumulating more and more de-novo assembled individuals, and revisiting the analyses, models, and hypotheses every time a new genome is added (no one genome is more privileged) - cf. the hierarchical, linear, reference-based, pipeline-like workflows that are common in genomics presently.

Well, as Arda’s GSoC comes to a close, the group I work in at EI has also submitted the paper for our latest C++ library/framework we use for working with complex polyploid genome graphs, to F1000.

With these two occurrences, the stuff I talked about at the 2018 con is about to get a lot more concrete - actual julia code exists now and more is on the way. Julia is getting its own native flavour of the framework. So I thought I would open this thread where I can make some announcements, and keep anyone who is interested up to date.

16 Likes

One of my first announcements on this thread is that the first component nessecery for the framework is complete, and will be documented with a manual, and then released on the BioJulia package registry within the next week.

ReadDatastores.jl provides types that efficiently index and store different types of sequencing reads on disk. These datastores are important because read datasets can be really quite large. Many tools get around this problem by streaming FASTQ files. But this strategy gets complex and undesirable when you want to do certain things with the reads.

Currently ReadDatastores has 3 kinds of datastore: LinkedReads, LongReads, and PairedReads. Each datastore stores and indexes the reads based on their purpose and the kinds of information those reads contain. For example paired reads are stored such that each paired read is situated next to its counter part, for LinkedReads (10x) the pairs are also stored grouped by their tags. This is because for such types of sequencing read, you normally are not interested in just one read, but pairs or groups that together help you resolve the structure of your genome.

Iterating over the sequences in one of these datastores is typically faster that iterating over a FASTA or FASTQ file using FASTX.jl or BioSequences.jl, as the sequence data is stored compressed using the bit encodings defined by BioSequences.jl.

Sequential iteration over these datastores is also made quicker still by sacrificing some RAM to wrap a buffer around the datastore for seq in buffer(ds) ,where ds is your datastore. You can also load data from a datastore into pre-allocated sequence objects if you want to minimise mutable type instance constructions and re-use your variables.

Anyway, take a look, the basic tests are now all passing, all that remains is some more testing, and some documentation before a first release can be cut.

9 Likes

ReadDatastores has been released as a v0.1.

So, here’s the plan, and also a question for you:

The repository currently named BioSequenceGraphs.jl is going to be the new core package of this framework.

It’s structure is going to be something like the following:

BioSequenceGraphs defines a Workspace (name up for debate) type representing the state of an assembled graph genome, it will be saveable and loadable to/from disk, and will bind together:

  • A SequenceDistanceGraph of the genome - done - GSoC.
  • A number of ReadDatastores - v0.1 done.
  • A number of ReadMappers - ToDo
  • A KmerCount store that stores indexed kmer counts from the graph, and the read datastores - ToDo
  • Zero or more extra distance graphs that contain extra linkage information between nodes in the SequenceDistanceGraph - Todo.

A NodeView and a LinkView type will allow for a single-entry point for node-centric analyses.
A NodeView from either a Workspace is a wrapper containing a ref to the underlying SequenceDistanceGraph and a node id. It will provide access to its nodes’ previous and next linked nodes, as well as info about the node’s mapped reads and/or k-mer coverage. A user with good understanding of the NodeView type should be able to access most information in the WorkSpace through it, making it the default choice for analysing the graph.

This framework forms a springboard for a whole ton of bioinformatics applications from assembly scaffolding to gene finding to RNASeq analysis. All brought together under one consistent genome graph framework.

I was going to rename BioSequenceGraphs.jl as it’s a bit of a mouthful. I was going to call it GenomeGraphs.jl, but wondered if that might a). Make some think it is a plotting package, b). confuse people as no GenomeGraph named type is actually exported from the package (and people shouldn’t be touching the SequenceDistanceGraph inside the workspace too much unless they’re trying to do something really custom - most interaction should be at the Workspace level). Yes, the package is about doing analysis in a framework in which the genome assemblies analysed are in their natural graph representation, but it’s not just about graphs: graphs are only one component/player in this system, which is much more about correctly projecting biological information over the assembly’s natural representation, getting away from the hierarchical reference bias prone, collapsed, linear reference-based analysis designs. So… renaming suggestions for BioSequenceGraphs.jl are welcome. In the meantime, I’m off to get some kmer counting implemented!

2 Likes

I think GenomeGraphs.jl is fine - LightGraphs.jl and family aren’t confused with Plotting stuff, right? It’s a bit against convention if there’s no GenomeGraph type, but I don’t think it’s confusing really. Plots.jl doesn’t export Plots.

This however changes my mind a bit. If the package isn’t actually about graphs, the package shouldn’t have a name that suggests it is. What about just BioGenomes.jl or simply Genomes.jl? Then again, it sounds like the framework overall actually spans multiple packages, and if the principle contribution of what is currently BioSequenceGraphs.jl is the mechanisms for working with the graph representation of genome assemblies, I think GenomeGraphs.jl is still fine.

2 Likes

At the moment my rule of thumb is actually to just develop the components in BioSequenceGraphs.jl, until the components and their API have been refined to a point of generality, justifying their own package - ReadDatastores was obviously a seperate package from the outset for me, since it was clearly generally useful, and not tied the the SequenceDistanceGraph type. I think mappers and kmer coverage utilities will get to that point eventually, but will start their v0.1 until v1.0 stage of life out as graph specific (and become less so as 1.0 is approached) as I get early versions of the whole system together in a useful state.

Made some nice progress today on kmer counting from a paired reads datastore, to get the canonical kmer specra required to build the first de-bruijn graph of a new Workspace.

I’ve designed the basic kernel for the job being a serial function that accepts a datastore name, a UnitInterval of read ids, and a “batch size”. The kernel gives a buffer vector of kmers a sizehint equal to the batch_size. Then, whilst the length of the vector is < batch size, a loop gets a read, and put’s all its kmers in the vector, gets the next read, and so on.

Once the batch is big enough, those kmers are sorted and collapsed into a vector of (kmer, count) pairs.
This vector of pairs is then merged into another vector of pairs that is destined to be the output, growing it.

This basic kernel function offers quite a few possiblities from there. For parallelism, a datastore can have it’s n reads divied up equally over j processes, and the (kmer,count) lists produced from each could then be merged together. Using such a strategy all counting is done in memory, but across multiple processes. If memory is limiting, another strategy could be for the processes to use this kernel function to make a series of (kmer, count) lists that are dumped to disk, and then once the worker processes are done, these dumps can be merged on the master process.

Another alternative could be to use RemoteChannels to send (kmer, count) list batches to the master process for merging periodically, although too frequent communication might be quite the overhead, the question is whether it is as much overhead as writing dumps to disk and then merging them.

Whatever the strategy, it should be fairly simple to achieve by dividing the reads in a datastore up, and composing the kernel function I drafted today, in some configuration across processes.

4 Likes

The framework is already very close to being able to build a basic assembly for smaller genomes: Here I try single process, serial kmer counting of a paired end ecoli library:

julia> using BioSequenceGraphs
[ Info: Recompiling stale cache file /Users/bward/.julia/compiled/v1.1/BioSequenceGraphs/YUHPz.ji for BioSequenceGraphs [3aa97884-90e5-11e8-2134-af657ec4f203]

julia> ds = open(PairedReads, "ecoli-pe-test.prseq")
Paired Read Datastore 'ecoli-pe-test': 988980 reads (494490 pairs)

julia> @time merlist = BioSequenceGraphs.build_freq_list(DNAKmer, buffer(ds), 1:988980)
 59.412060 seconds (774.65 k allocations: 1.951 GiB, 0.57% gc time)
49411919-element Array{BioSequenceGraphs.MerFreq{Mer{BioSequences.DNAAlphabet{2},31}},1}:
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA occurs 106 times
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC occurs 5 times  
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG occurs 1 times  
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAT occurs 3 times  
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAACA occurs 5 times  
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAACC occurs 1 times  
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAACT occurs 1 times  
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAGA occurs 1 times  
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAATA occurs 3 times  
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAATT occurs 1 times  
 AAAAAAAAAAAAAAAAAAAAAAAAAAAACAA occurs 5 times  
 AAAAAAAAAAAAAAAAAAAAAAAAAAAACCA occurs 1 times  
 AAAAAAAAAAAAAAAAAAAAAAAAAAAACTA occurs 1 times  
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAGAA occurs 2 times  
 AAAAAAAAAAAAAAAAAAAAAAAAAAAATAA occurs 3 times  
 AAAAAAAAAAAAAAAAAAAAAAAAAAAATTA occurs 1 times  
 AAAAAAAAAAAAAAAAAAAAAAAAAAACAAA occurs 4 times  
 AAAAAAAAAAAAAAAAAAAAAAAAAAACAAC occurs 1 times  
 AAAAAAAAAAAAAAAAAAAAAAAAAAACACA occurs 1 times  
 AAAAAAAAAAAAAAAAAAAAAAAAAAACCAA occurs 1 times  
 AAAAAAAAAAAAAAAAAAAAAAAAAAACTAC occurs 1 times  
 AAAAAAAAAAAAAAAAAAAAAAAAAAAGAAA occurs 3 times  
 AAAAAAAAAAAAAAAAAAAAAAAAAAATAAA occurs 3 times  
 AAAAAAAAAAAAAAAAAAAAAAAAAAATTAA occurs 1 times  
 AAAAAAAAAAAAAAAAAAAAAAAAAACAAAA occurs 4 times  
 ⋮                                               
 TTTTTTTTATTTTAAATTTAATGAAAAAAAA occurs 1 times  
 TTTTTTTTATTTTTTAAAAAAAAAAAAAAAA occurs 1 times  
 TTTTTTTTGTTAAGTAAAAAAAAAAAAAAAA occurs 1 times  
 TTTTTTTTTAAAAACAATAAGAAAAAAAAAA occurs 1 times  
 TTTTTTTTTAATAAAAAAAAAAAAAAAAAAA occurs 1 times  
 TTTTTTTTTAATACAAAAAAAAAAAAAAAAA occurs 1 times  
 TTTTTTTTTAATTATTTTAAATAAAAAAAAA occurs 1 times  
 TTTTTTTTTACAAACAAAAAAAAAAAAAAAA occurs 1 times  
 TTTTTTTTTGTTAAGTAAAAAAAAAAAAAAA occurs 1 times  
 TTTTTTTTTTAATAAAAAAAAAAAAAAAAAA occurs 1 times  
 TTTTTTTTTTAATACAAAAAAAAAAAAAAAA occurs 1 times  
 TTTTTTTTTTACAAACAAAAAAAAAAAAAAA occurs 1 times  
 TTTTTTTTTTGTTAAGTAAAAAAAAAAAAAA occurs 1 times  
 TTTTTTTTTTTAATAAAAAAAAAAAAAAAAA occurs 1 times  
 TTTTTTTTTTTAATACAAAAAAAAAAAAAAA occurs 1 times  
 TTTTTTTTTTTACAAACAAAAAAAAAAAAAA occurs 1 times  
 TTTTTTTTTTTACTTAACAAAAAAAAAAAAA occurs 1 times  
 TTTTTTTTTTTGTTAAGTAAAAAAAAAAAAA occurs 1 times  
 TTTTTTTTTTTGTTTGTAAAAAAAAAAAAAA occurs 1 times  
 TTTTTTTTTTTTAATAAAAAAAAAAAAAAAA occurs 1 times  
 TTTTTTTTTTTTAATACAAAAAAAAAAAAAA occurs 1 times  
 TTTTTTTTTTTTACAAACAAAAAAAAAAAAA occurs 1 times  
 TTTTTTTTTTTTACTTAACAAAAAAAAAAAA occurs 1 times  
 TTTTTTTTTTTTGTTTGTAAAAAAAAAAAAA occurs 1 times  
 TTTTTTTTTTTTTAATACAAAAAAAAAAAAA occurs 1 times  

julia> using Makie

julia> freq_hist = hist(merlist, 2) # Only include the kmers that occured 2 times or more (plotting y axis is riddiculous otherwise)
BioSequenceGraphs.MerFreqHist(UInt64[0x0000000000000000, 0x000000000011dc2d, 0x0000000000017064, 0x0000000000003522, 0x0000000000000f5f, 0x00000000000006df, 0x00000000000002f0, 0x0000000000000257, 0x0000000000000195, 0x00000000000000fd  …  0x0000000000000028, 0x000000000000001d, 0x0000000000000025, 0x0000000000000016, 0x000000000000001a, 0x0000000000000020, 0x000000000000001d, 0x000000000000001b, 0x0000000000001e8c, 0x0000000000000000], 0x02)

julia> barplot(freq_hist.data[1:100])

Your basic frequency histogram: X axis is count of kmer in the reads, Y axis is the number of kmers in the reads with said count

Fairly nice seperation of erroneous motifs (spike near x=1) from the main sampling distribution there, so a first de-bruijn graph built from this kmer list, with a cutoff of the kmers that appeared at least 10-15 times should make a nice starting assembly.

This could be fed to the functions built during the GSoC. But rather than do it manually step by step with datastores and so on, eventually just using a so-called “process method” on the workspace called “dbg” will go through all the steps to give you a workspace with your first graph.

2 Likes

Here’s a taste of what the dbg process method will feel like.

In this framework, the docs and API will describe some functions/methods as “process” functions or simply just “processes”. These methods are basically as high level as you can get in the framework. They implement some sequence/pipeline of operations on the workspace that are incredibly common (and so rather than script them every time, you probably just want to call a single command and then go for a coffee!.. or go home, some processes might be long running).

dbg will accept a kmer type, a minimum frequency cutoff, and the filename of a paired-read datastore. As output, it will give you a WorkSpace, with an initial compressed de-bruijn graph (also called a unitig graph) of all the kmers in the reads that had a frequency exceeding the minimum frequency cutoff.

Here’s an illustration of how that might look:

julia> using BioSequenceGraphs

julia> ws = dbg(DNAKmer, 15, "ecoli-pe-test.prseq")
[ Info: Counting kmers in datastore
[ Info: Constructing compressed de-bruijn graph from 4554902 31-mers
[ Info: Constructing unitigs from 4554902 31-mers
┌ Warning: Some kmers have not been incorporated into unitigs. This may be a case of the circle problem
│   kmerlist[(!).(used_kmers)] =
│    1-element Array{Mer{BioSequences.DNAAlphabet{2},31},1}:
│     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
└ @ BioSequenceGraphs ~/repos/software/BioJulia/SequenceGraphs/src/processes/dbg.jl:182
[ Info: Constructed 2280 unitigs
[ Info: Identifying the 30bp (K - 1) overlaps between 2280 unitigs
[ Info: Linking 2280 unitigs by their 30bp (K - 1) overlaps
[ Info: Done constructing compressed de-bruijn graph from 4554902 31-mers
Graph Genome workspace
 Sequence distance graph (2280 nodes)
 Paired Read Datastores (1):
  'ecoli-pe-test': 988980 reads (494490 pairs)
 Long Read Datastores (0):
 Linked Read Datastores (0):


julia> const BANDAGE_BIN = "/Applications/Bandage.app/Contents/MacOS/Bandage"
"/Applications/Bandage.app/Contents/MacOS/Bandage"

julia> function show_graph(gr)
           filename = tempname()
           BioSequenceGraphs.dump_to_gfa1(gr, filename)
           run(`$BANDAGE_BIN image $filename.gfa $filename.png --height 500`)
           run(`open $filename.png`)
       end
show_graph (generic function with 1 method)

julia> show_graph(ws.sdg)
Process(`open /var/folders/h0/fmczc7390c7f_ddk7yxc47h801tjlx/T/juliaKicOIh.png`, ProcessExited(0))


This ecoli graph is a little messier than what I’m used to seeing out of the w2rap-contigger, which we still use currently to give us a base graph. But you know what, just going by eye, it’s not bad for a raw graph! Without any error correction, tip-clipping or additional contigging, I can already make out the very long loops I’m used to seeing every time I run w2rap on an Ecoli dataset. The dbg process does the equivalent of running the first 2 steps (…well, the first step and a bit of the second step) of w2rap (and there are 8 contigging steps, so yeah, of course dbg makes a messier graph).

Processes may also have a ! version which edits a workspace:

julia> using BioSequenceGraphs
[ Info: Recompiling stale cache file /Users/bward/.julia/compiled/v1.1/BioSequenceGraphs/YUHPz.ji for BioSequenceGraphs [3aa97884-90e5-11e8-2134-af657ec4f203]

julia> ws = WorkSpace()
Graph Genome workspace
 Sequence distance graph (0 nodes)
 Paired Read Datastores (0):
 Long Read Datastores (0):
 Linked Read Datastores (0):


julia> add_paired_reads!(ws, "ecoli-pe-test.prseq")
Graph Genome workspace
 Sequence distance graph (0 nodes)
 Paired Read Datastores (1):
  'ecoli-pe-test': 988980 reads (494490 pairs)
 Long Read Datastores (0):
 Linked Read Datastores (0):


julia> BioSequenceGraphs.dbg!(DNAKmer, 15, ws, "ecoli-pe-test")
[ Info: Counting kmers in datastore
[ Info: Constructing compressed de-bruijn graph from 4554902 31-mers
[ Info: Constructing unitigs from 4554902 31-mers
┌ Warning: Some kmers have not been incorporated into unitigs. This may be a case of the circle problem
│   kmerlist[(!).(used_kmers)] =
│    1-element Array{Mer{BioSequences.DNAAlphabet{2},31},1}:
│     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
└ @ BioSequenceGraphs ~/repos/software/BioJulia/SequenceGraphs/src/processes/dbg.jl:157
[ Info: Constructed 2280 unitigs
[ Info: Identifying the 30bp (K - 1) overlaps between 2280 unitigs
[ Info: Linking 2280 unitigs by their 30bp (K - 1) overlaps
[ Info: Done constructing compressed de-bruijn graph from 4554902 31-mers
Graph Genome workspace
 Sequence distance graph (2280 nodes)
 Paired Read Datastores (1):
  'ecoli-pe-test': 988980 reads (494490 pairs)
 Long Read Datastores (0):
 Linked Read Datastores (0):
5 Likes

If you use a BigDNAKmer from BioSequences, allowing a K of 63, you get a much less “hairbally” graph as expected:

1 Like

So, I’ve been working on the functionality that allows you to project kmer counts (kmer coverage) across nodes of a graph (and other sequences), and want to know what you guys think about this issue of type stability.

So in our C++ framework, the k of a kmer is a runtime variable of a kmer class. In BioJulia, the K parameter is a parameter of the kmer type. This makes sense and some advantages - functions can be written such that two sets of kmers input can’t be of different k, vectors of kmers can enforce k is the same for all elements, many calculations can be worked out and make a constant at compile time.

The type that will store kmer counts indexed against the graph is called MerCounts{M<:AbstractMer}. It looks like this:

struct MerCounts{M<:AbstractMer}
    mer_index::Vector{M}            # Ordered list of kmer that contain counts
    count_names::Vector{Symbol}     # Names of the count vectors
    counts::Vector{Vector{UInt16}}  # Count vectors, each contains an entry per mer in the index
    name::Symbol                    # The name of this MerCounts
    counting_mode::KmerCountMode
end

Storing them in the WorkSpace in a vector however, means mer_count_stores::Vector{MerCounts}, which is an abstract type. And so, getting a specific mer count store from the workspace results in a type uncertainty:

function mer_counts(ws::WorkSpace, name::Symbol)
    for mc in ws.mer_count_stores
        if mc.name == name
            return mc
        end
    end
    error(string("WorkSpace has no mer count datastore datastore called ", name))
end
julia> @code_warntype mer_counts(ws, :testcounts)
Body::MerCounts
1 ── %1  = (Base.getfield)(ws, :mer_count_stores)::Array{MerCounts,1}
│    %2  = (Base.arraylen)(%1)::Int64
│    %3  = (Base.sle_int)(0, %2)::Bool
│    %4  = (Base.bitcast)(UInt64, %2)::UInt64
│    %5  = (Base.ult_int)(0x0000000000000000, %4)::Bool
│    %6  = (Base.and_int)(%3, %5)::Bool
└───       goto #3 if not %6
2 ── %8  = (Base.arrayref)(false, %1, 1)::MerCounts
└───       goto #4
3 ──       goto #4
4 ┄─ %11 = φ (#2 => false, #3 => true)::Bool
│    %12 = φ (#2 => %8)::MerCounts
│    %13 = φ (#2 => 2)::Int64
└───       goto #5
5 ── %15 = (Base.not_int)(%11)::Bool
└───       goto #13 if not %15
6 ┄─ %17 = φ (#5 => %12, #12 => %35)::MerCounts
│    %18 = φ (#5 => %13, #12 => %36)::Int64
│    %19 = (Base.getfield)(%17, :name)::Symbol
│    %20 = (%19 === name)::Bool
└───       goto #8 if not %20
7 ──       return %17
8 ── %23 = (Base.bitcast)(UInt64, %18)::UInt64
│    %24 = (Base.sub_int)(%23, 0x0000000000000001)::UInt64
│    %25 = (Base.arraylen)(%1)::Int64
│    %26 = (Base.sle_int)(0, %25)::Bool
│    %27 = (Base.bitcast)(UInt64, %25)::UInt64
│    %28 = (Base.ult_int)(%24, %27)::Bool
│    %29 = (Base.and_int)(%26, %28)::Bool
└───       goto #10 if not %29
9 ── %31 = (Base.arrayref)(false, %1, %18)::MerCounts
│    %32 = (Base.add_int)(%18, 1)::Int64
└───       goto #11
10 ─       goto #11
11 ┄ %35 = φ (#9 => %31)::MerCounts
│    %36 = φ (#9 => %32)::Int64
│    %37 = φ (#9 => false, #10 => true)::Bool
│    %38 = (Base.not_int)(%37)::Bool
└───       goto #13 if not %38
12 ─       goto #6
13 ┄ %41 = invoke Base.print_to_string("WorkSpace has no mer count datastore datastore called "::String, _3::Vararg{Any,N} where N)::String
│          invoke BioSequenceGraphs.error(%41::String)
└───       $(Expr(:unreachable))

I’m not sure if I should do anything about this, presumably if I seperate functions actually doing work i.e.:

function myfun(ws::WorkSpace, name::Symbol)
    counts = mer_counts(ws, name)
    do_the_thing(counts)
end

Then the type uncertainty should be contained and the do_the_thing will be compiled and efficient.

Another solution could be to remove the parameter from the type:

struct MerCounts{M<:AbstractMer}
    mer_type::DataType
    mer_index::Vector{UInt64}            # Ordered list of kmer that contain counts
    count_names::Vector{Symbol}     # Names of the count vectors
    counts::Vector{Vector{UInt16}}  # Count vectors, each contains an entry per mer in the index
    name::Symbol                    # The name of this MerCounts
    counting_mode::KmerCountMode
end

But that’s not overly great as I’d have to make sure that mer_type was checked quite a lot.

Anyway, if anyone has feedback or ideas on limiting the type uncertainty as much as possible (I think what I have now will be fine with function barriers), let me know!

2 Likes

This week has been a bit more hectic than usual, but I did still manage to get some coding done.
This week I’ve created a type called the UniqueMerIndex.

This index takes a graph, and generates all the kmers in the graph, it indexes every kmer that appears in the graph only once, storing for each kmer, the node (with strand orientation), and starting position of the kmer.

julia> using BioSequences, BioSequenceGraphs

julia> ws = dbg(BigDNAMer{60}, 10, "ecoli-pe-test.prseq")
[ Info: Counting kmers in datastore
[ Info: Constructing compressed de-bruijn graph from 4567865 60-mers
[ Info: Constructing unitigs from 4567865 60-mers
[ Info: Constructed 894 unitigs
[ Info: Identifying the 59bp (K - 1) overlaps between 894 unitigs
[ Info: Linking 894 unitigs by their 59bp (K - 1) overlaps
[ Info: Done constructing compressed de-bruijn graph from 4567865 60-mers
Graph Genome workspace
 Sequence distance graph (894 nodes)
 Paired Read Datastores (1):
  'ecoli-pe-test': 988980 reads (494490 pairs)
 Long Read Datastores (0):
 Linked Read Datastores (0):
 Mer Count Datastores (0):


julia> idx = BioSequenceGraphs.UniqueMerIndex{DNAKmer}(ws.sdg)
BioSequenceGraphs.UniqueMerIndex{Mer{DNAAlphabet{2},31}}(Dict(ACATTTAATATTTAATTATGAGTTCCTCACC=>(Node: 482, Base position: 19161),ATCAAGAAATCCTCACTGATCCTTCCTATTC=>(Node: 625, Base position: 9261),ATTCGGCATTATCGTTGCCCTGTTTACGCTG=>(Node: 274, Base position: 25673),CAGTGCCTCGCAGGTAAGCGCCTGGCCTATG=>(Node: -274, Base position: 30764),GTCGTTGAGTTTGATAAAAACGGTACGGCAA=>(Node: 714, Base position: 59650),ACCAGTTGCCGCCAGCGGTCCGAAACGCGTA=>(Node: -36, Base position: 31793),GGCCTGTTTTGCGTTTGTTTCTGTCTCAAGA=>(Node: 316, Base position: 7510),CGAACAAGAAACAGATAACCACGGGCACCAG=>(Node: -37, Base position: 26733),CGGTGATCACGCCAGCGCTATTAATCTGCCA=>(Node: 394, Base position: 61574),TCTTTAAACTTCTTTACATTAGGTTATGTAA=>(Node: -77, Base position: 68986)…), UInt64[0x0000000000000000, 0x0000000000000000, 0x0000000000000030, 0x000000000000029a, 0x0000000000000000, 0x0000000000004388, 0x00000000000000bc, 0x0000000000000255, 0x000000000000005d, 0x0000000000000000  …  0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000], UInt64[0x0000000000000021, 0x000000000000002a, 0x000000000000006a, 0x00000000000002d4, 0x000000000000003a, 0x00000000000043c2, 0x0000000000000103, 0x0000000000000294, 0x000000000000009c, 0x0000000000000023  …  0x000000000000001e, 0x000000000000639e, 0x000000000000001e, 0x000000000000001e, 0x000000000000001e, 0x000000000000001e, 0x000000000000001e, 0x000000000000001e, 0x000000000000001e, 0x000000000000001e])

julia> BioSequenceGraphs.find_unique_mer(idx, mer"ATCAAGAAATCCTCTCTGATCCTTGGTATTC"dna)
(false, (Node: 0, Base position: 0))

julia> BioSequenceGraphs.find_unique_mer(idx, mer"ATCAAGAAATCCTCACTGATCCTTCCTATTC"dna)
(true, (Node: 625, Base position: 9261))

This will be the basis of a reliable, but somewhat strict paired-end read mapper. Paired-end reads are highly accurate in base composition and so mapping using the unique kmers in a graph is a simple method that we’ve found gives pretty good results, and often, any reads that don’t neatly sit inside a node of the graph can be thrown away without discarding too much good information. But of course other kmer indexes for mappers will be included, and chaining of paired-end read mappings should also be doable, we’ve just not had reason to do it yet. Long reads are a different matter.

I’ve also started plans for visualizing graphs. Currently this has no code, as I’m planning how we should view and vizualize GenomeGraphs in julia - drafting any pseudocode with pen and paper until I’m happy with the overall plan.

2 Likes

Ok, just to say some documentation is beginning to appear now over at https://biojulia.net/GenomeGraphs.jl/latest/

Also the package has been renamed from BioSequenceGraphs to GenomeGraphs.

The priority for me now is to get Arda’s GSoC work merged into master and to massage it into the WorkSpace framework. Then we will have de-Bruijn graph construction, bubble-popping and tip-cleaning graph clean-up transformations. Which is probably a nice point to cut a rough v0.1. From there we can implement more areas of the WorkSpace or make perf improvements.

I’m currently testing the basics with E.coli so expect some performance nasties with bigger datasets (especially during kmer counting), unless you’re on a chunky machine with RAM to spare.

1 Like

Alrighty! We have a tip clipping process! Merging GSoC branches and getting tip clipping ready revealed several really stupid bugs of mine! Including writing to GFA and reversing paths through the genome graph, which explain why the previous images of the genome I posted looked a bit weird to me.

Let’s see it. First we’ll get a basic unitig graph / compressed de-Bruijn graph:

julia> using GenomeGraphs

julia> ws = dbg(BigDNAMer{63}, 5, "ecoli-test-paired.prseq")
[ Info: Counting kmers in datastore
[ Info: Constructing compressed de-bruijn graph from 4572542 63-mers
[ Info: Constructing unitigs from 4572542 63-mers
[ Info: Constructed 1391 unitigs
[ Info: Identifying the 62bp (K - 1) overlaps between 1391 unitigs
[ Info: Linking 1391 unitigs by their 62bp (K - 1) overlaps
[ Info: Done constructing compressed de-bruijn graph from 4572542 63-mers
Graph Genome workspace
 Sequence distance graph (1391 nodes)
 Paired Read Datastores (1):
  'my-ecoli': 988980 reads (494490 pairs)
 Long Read Datastores (0):
 Linked Read Datastores (0):
 Mer Count Datastores (0):

julia> GenomeGraphs.write_to_gfa1(ws.sdg, "ecoli-test-dbg")
[ Info: Saving graph to ecoli-test-dbg

Ok let’s see how this looks:

Ok, let’s zoom in on an area of the genome graph that would clearly be more contiguous and improved by some tip clipping:

Ok, these red nodes should be a single unitig, but they’re not because of these two blue “tip nodes”. For anyone unfamiliar with DNA sequencing, tip nodes are the result of sequencer errors, which are typically distributed towards the ends of the sequencing reads - the sequencing by synthesis becomes more error-prone over time as the reagents are consumed and the reaction progresses, hence errors occur more often at the end of reads. Erroneous nodes generated by erroneous kmers at the ends of reads don’t typically have forward de-bruijn neighbours - hence tips. The N50 of the graph (a stat that measures the contiguity (NOT quality!) of the assembly) in this state is 35,172bp.

Ok let’s get rid of these things using the tip clipping:

julia> GenomeGraphs.remove_tips!(ws, 200)
[ Info: Beginning tip removal process
[ Info: Pass number: 1
[ Info: Found 336 tips
[ Info: Removed 336 tips
[ Info: Collapsing any resulting transient paths
[ Info: Pass number: 2
[ Info: Found 20 tips
[ Info: Removed 20 tips
[ Info: Collapsing any resulting transient paths
[ Info: Pass number: 3
[ Info: Found 1 tips
[ Info: Removed 1 tips
[ Info: Collapsing any resulting transient paths
[ Info: Pass number: 4
[ Info: No more tips in graph
[ Info: Finished tip removal process in 4 passes
Graph Genome workspace
 Sequence distance graph (1493 nodes)
 Paired Read Datastores (1):
  'my-ecoli': 988980 reads (494490 pairs)
 Long Read Datastores (0):
 Linked Read Datastores (0):
 Mer Count Datastores (0):

julia> GenomeGraphs.write_to_gfa1(ws.sdg, "ecoli-test-notip")
[ Info: Saving graph to ecoli-test-notip

Awesome, no more tips, there are some bubbles and repetitions and some other structures - we’re not at the fully resolved circular ecoli genome yet, but the N50 has already gone up to 60,782bp, and this genome is already useful for many downstream purposes.

2 Likes

This looks awesome! Great job!

Thanks! Our GSoC student for this year Arda Akdemir, deserves most credit for tip clipping. I just merged and re-factored: the origonal GSoC design didn’t make use of the graph links and worked with just the kmers (unless I didn’t understand it!), I refactored and made use of the links info in the graph and made it fit as an iterative step in this GenomeGraphs dbg -> tip_clipping / bubble_pop -> K200 read pathing -> “yet more scaffolding tm” workflow that’s intended.