ANN - Towards a (Bio)Julia powered Genome Graphs framework

Ward9250 · August 26, 2019, 11:07pm

Here’s a taste of what the dbg process method will feel like.

In this framework, the docs and API will describe some functions/methods as “process” functions or simply just “processes”. These methods are basically as high level as you can get in the framework. They implement some sequence/pipeline of operations on the workspace that are incredibly common (and so rather than script them every time, you probably just want to call a single command and then go for a coffee!.. or go home, some processes might be long running).

dbg will accept a kmer type, a minimum frequency cutoff, and the filename of a paired-read datastore. As output, it will give you a WorkSpace, with an initial compressed de-bruijn graph (also called a unitig graph) of all the kmers in the reads that had a frequency exceeding the minimum frequency cutoff.

Here’s an illustration of how that might look:

julia> using BioSequenceGraphs

julia> ws = dbg(DNAKmer, 15, "ecoli-pe-test.prseq")
[ Info: Counting kmers in datastore
[ Info: Constructing compressed de-bruijn graph from 4554902 31-mers
[ Info: Constructing unitigs from 4554902 31-mers
┌ Warning: Some kmers have not been incorporated into unitigs. This may be a case of the circle problem
│   kmerlist[(!).(used_kmers)] =
│    1-element Array{Mer{BioSequences.DNAAlphabet{2},31},1}:
│     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
└ @ BioSequenceGraphs ~/repos/software/BioJulia/SequenceGraphs/src/processes/dbg.jl:182
[ Info: Constructed 2280 unitigs
[ Info: Identifying the 30bp (K - 1) overlaps between 2280 unitigs
[ Info: Linking 2280 unitigs by their 30bp (K - 1) overlaps
[ Info: Done constructing compressed de-bruijn graph from 4554902 31-mers
Graph Genome workspace
 Sequence distance graph (2280 nodes)
 Paired Read Datastores (1):
  'ecoli-pe-test': 988980 reads (494490 pairs)
 Long Read Datastores (0):
 Linked Read Datastores (0):


julia> const BANDAGE_BIN = "/Applications/Bandage.app/Contents/MacOS/Bandage"
"/Applications/Bandage.app/Contents/MacOS/Bandage"

julia> function show_graph(gr)
           filename = tempname()
           BioSequenceGraphs.dump_to_gfa1(gr, filename)
           run(`$BANDAGE_BIN image $filename.gfa $filename.png --height 500`)
           run(`open $filename.png`)
       end
show_graph (generic function with 1 method)

julia> show_graph(ws.sdg)
Process(`open /var/folders/h0/fmczc7390c7f_ddk7yxc47h801tjlx/T/juliaKicOIh.png`, ProcessExited(0))

This ecoli graph is a little messier than what I’m used to seeing out of the w2rap-contigger, which we still use currently to give us a base graph. But you know what, just going by eye, it’s not bad for a raw graph! Without any error correction, tip-clipping or additional contigging, I can already make out the very long loops I’m used to seeing every time I run w2rap on an Ecoli dataset. The dbg process does the equivalent of running the first 2 steps (…well, the first step and a bit of the second step) of w2rap (and there are 8 contigging steps, so yeah, of course dbg makes a messier graph).

Processes may also have a ! version which edits a workspace:

julia> using BioSequenceGraphs
[ Info: Recompiling stale cache file /Users/bward/.julia/compiled/v1.1/BioSequenceGraphs/YUHPz.ji for BioSequenceGraphs [3aa97884-90e5-11e8-2134-af657ec4f203]

julia> ws = WorkSpace()
Graph Genome workspace
 Sequence distance graph (0 nodes)
 Paired Read Datastores (0):
 Long Read Datastores (0):
 Linked Read Datastores (0):


julia> add_paired_reads!(ws, "ecoli-pe-test.prseq")
Graph Genome workspace
 Sequence distance graph (0 nodes)
 Paired Read Datastores (1):
  'ecoli-pe-test': 988980 reads (494490 pairs)
 Long Read Datastores (0):
 Linked Read Datastores (0):


julia> BioSequenceGraphs.dbg!(DNAKmer, 15, ws, "ecoli-pe-test")
[ Info: Counting kmers in datastore
[ Info: Constructing compressed de-bruijn graph from 4554902 31-mers
[ Info: Constructing unitigs from 4554902 31-mers
┌ Warning: Some kmers have not been incorporated into unitigs. This may be a case of the circle problem
│   kmerlist[(!).(used_kmers)] =
│    1-element Array{Mer{BioSequences.DNAAlphabet{2},31},1}:
│     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
└ @ BioSequenceGraphs ~/repos/software/BioJulia/SequenceGraphs/src/processes/dbg.jl:157
[ Info: Constructed 2280 unitigs
[ Info: Identifying the 30bp (K - 1) overlaps between 2280 unitigs
[ Info: Linking 2280 unitigs by their 30bp (K - 1) overlaps
[ Info: Done constructing compressed de-bruijn graph from 4554902 31-mers
Graph Genome workspace
 Sequence distance graph (2280 nodes)
 Paired Read Datastores (1):
  'ecoli-pe-test': 988980 reads (494490 pairs)
 Long Read Datastores (0):
 Linked Read Datastores (0):

Topic		Replies	Views
[GSoC] Parallelism in Bio.jl Community	7	1174	April 3, 2017
ANN: ForestBiometrics.jl Community	9	1240	September 6, 2017
On the performance and design of BioSequences compared to the Seq language Community	0	404	January 25, 2020
Genomic Annotation Package? Biology, Health, and Medicine	1	327	May 4, 2023
Interest in RNA-seq specific convenience package based on BioJulia? Biology, Health, and Medicine	16	1395	July 17, 2023

ANN - Towards a (Bio)Julia powered Genome Graphs framework

Related topics