Here’s a taste of what the dbg
process method will feel like.
In this framework, the docs and API will describe some functions/methods as “process” functions or simply just “processes”. These methods are basically as high level as you can get in the framework. They implement some sequence/pipeline of operations on the workspace that are incredibly common (and so rather than script them every time, you probably just want to call a single command and then go for a coffee!.. or go home, some processes might be long running).
dbg
will accept a kmer type, a minimum frequency cutoff, and the filename of a paired-read datastore. As output, it will give you a WorkSpace
, with an initial compressed de-bruijn graph (also called a unitig graph) of all the kmers in the reads that had a frequency exceeding the minimum frequency cutoff.
Here’s an illustration of how that might look:
julia> using BioSequenceGraphs
julia> ws = dbg(DNAKmer, 15, "ecoli-pe-test.prseq")
[ Info: Counting kmers in datastore
[ Info: Constructing compressed de-bruijn graph from 4554902 31-mers
[ Info: Constructing unitigs from 4554902 31-mers
┌ Warning: Some kmers have not been incorporated into unitigs. This may be a case of the circle problem
│ kmerlist[(!).(used_kmers)] =
│ 1-element Array{Mer{BioSequences.DNAAlphabet{2},31},1}:
│ AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
└ @ BioSequenceGraphs ~/repos/software/BioJulia/SequenceGraphs/src/processes/dbg.jl:182
[ Info: Constructed 2280 unitigs
[ Info: Identifying the 30bp (K - 1) overlaps between 2280 unitigs
[ Info: Linking 2280 unitigs by their 30bp (K - 1) overlaps
[ Info: Done constructing compressed de-bruijn graph from 4554902 31-mers
Graph Genome workspace
Sequence distance graph (2280 nodes)
Paired Read Datastores (1):
'ecoli-pe-test': 988980 reads (494490 pairs)
Long Read Datastores (0):
Linked Read Datastores (0):
julia> const BANDAGE_BIN = "/Applications/Bandage.app/Contents/MacOS/Bandage"
"/Applications/Bandage.app/Contents/MacOS/Bandage"
julia> function show_graph(gr)
filename = tempname()
BioSequenceGraphs.dump_to_gfa1(gr, filename)
run(`$BANDAGE_BIN image $filename.gfa $filename.png --height 500`)
run(`open $filename.png`)
end
show_graph (generic function with 1 method)
julia> show_graph(ws.sdg)
Process(`open /var/folders/h0/fmczc7390c7f_ddk7yxc47h801tjlx/T/juliaKicOIh.png`, ProcessExited(0))
This ecoli graph is a little messier than what I’m used to seeing out of the w2rap-contigger
, which we still use currently to give us a base graph. But you know what, just going by eye, it’s not bad for a raw graph! Without any error correction, tip-clipping or additional contigging, I can already make out the very long loops I’m used to seeing every time I run w2rap on an Ecoli dataset. The dbg
process does the equivalent of running the first 2 steps (…well, the first step and a bit of the second step) of w2rap (and there are 8 contigging steps, so yeah, of course dbg
makes a messier graph).
Processes may also have a !
version which edits a workspace:
julia> using BioSequenceGraphs
[ Info: Recompiling stale cache file /Users/bward/.julia/compiled/v1.1/BioSequenceGraphs/YUHPz.ji for BioSequenceGraphs [3aa97884-90e5-11e8-2134-af657ec4f203]
julia> ws = WorkSpace()
Graph Genome workspace
Sequence distance graph (0 nodes)
Paired Read Datastores (0):
Long Read Datastores (0):
Linked Read Datastores (0):
julia> add_paired_reads!(ws, "ecoli-pe-test.prseq")
Graph Genome workspace
Sequence distance graph (0 nodes)
Paired Read Datastores (1):
'ecoli-pe-test': 988980 reads (494490 pairs)
Long Read Datastores (0):
Linked Read Datastores (0):
julia> BioSequenceGraphs.dbg!(DNAKmer, 15, ws, "ecoli-pe-test")
[ Info: Counting kmers in datastore
[ Info: Constructing compressed de-bruijn graph from 4554902 31-mers
[ Info: Constructing unitigs from 4554902 31-mers
┌ Warning: Some kmers have not been incorporated into unitigs. This may be a case of the circle problem
│ kmerlist[(!).(used_kmers)] =
│ 1-element Array{Mer{BioSequences.DNAAlphabet{2},31},1}:
│ AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
└ @ BioSequenceGraphs ~/repos/software/BioJulia/SequenceGraphs/src/processes/dbg.jl:157
[ Info: Constructed 2280 unitigs
[ Info: Identifying the 30bp (K - 1) overlaps between 2280 unitigs
[ Info: Linking 2280 unitigs by their 30bp (K - 1) overlaps
[ Info: Done constructing compressed de-bruijn graph from 4554902 31-mers
Graph Genome workspace
Sequence distance graph (2280 nodes)
Paired Read Datastores (1):
'ecoli-pe-test': 988980 reads (494490 pairs)
Long Read Datastores (0):
Linked Read Datastores (0):