This week has been a bit more hectic than usual, but I did still manage to get some coding done.
This week I’ve created a type called the UniqueMerIndex
.
This index takes a graph, and generates all the kmers in the graph, it indexes every kmer that appears in the graph only once, storing for each kmer, the node (with strand orientation), and starting position of the kmer.
julia> using BioSequences, BioSequenceGraphs
julia> ws = dbg(BigDNAMer{60}, 10, "ecoli-pe-test.prseq")
[ Info: Counting kmers in datastore
[ Info: Constructing compressed de-bruijn graph from 4567865 60-mers
[ Info: Constructing unitigs from 4567865 60-mers
[ Info: Constructed 894 unitigs
[ Info: Identifying the 59bp (K - 1) overlaps between 894 unitigs
[ Info: Linking 894 unitigs by their 59bp (K - 1) overlaps
[ Info: Done constructing compressed de-bruijn graph from 4567865 60-mers
Graph Genome workspace
Sequence distance graph (894 nodes)
Paired Read Datastores (1):
'ecoli-pe-test': 988980 reads (494490 pairs)
Long Read Datastores (0):
Linked Read Datastores (0):
Mer Count Datastores (0):
julia> idx = BioSequenceGraphs.UniqueMerIndex{DNAKmer}(ws.sdg)
BioSequenceGraphs.UniqueMerIndex{Mer{DNAAlphabet{2},31}}(Dict(ACATTTAATATTTAATTATGAGTTCCTCACC=>(Node: 482, Base position: 19161),ATCAAGAAATCCTCACTGATCCTTCCTATTC=>(Node: 625, Base position: 9261),ATTCGGCATTATCGTTGCCCTGTTTACGCTG=>(Node: 274, Base position: 25673),CAGTGCCTCGCAGGTAAGCGCCTGGCCTATG=>(Node: -274, Base position: 30764),GTCGTTGAGTTTGATAAAAACGGTACGGCAA=>(Node: 714, Base position: 59650),ACCAGTTGCCGCCAGCGGTCCGAAACGCGTA=>(Node: -36, Base position: 31793),GGCCTGTTTTGCGTTTGTTTCTGTCTCAAGA=>(Node: 316, Base position: 7510),CGAACAAGAAACAGATAACCACGGGCACCAG=>(Node: -37, Base position: 26733),CGGTGATCACGCCAGCGCTATTAATCTGCCA=>(Node: 394, Base position: 61574),TCTTTAAACTTCTTTACATTAGGTTATGTAA=>(Node: -77, Base position: 68986)…), UInt64[0x0000000000000000, 0x0000000000000000, 0x0000000000000030, 0x000000000000029a, 0x0000000000000000, 0x0000000000004388, 0x00000000000000bc, 0x0000000000000255, 0x000000000000005d, 0x0000000000000000 … 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000], UInt64[0x0000000000000021, 0x000000000000002a, 0x000000000000006a, 0x00000000000002d4, 0x000000000000003a, 0x00000000000043c2, 0x0000000000000103, 0x0000000000000294, 0x000000000000009c, 0x0000000000000023 … 0x000000000000001e, 0x000000000000639e, 0x000000000000001e, 0x000000000000001e, 0x000000000000001e, 0x000000000000001e, 0x000000000000001e, 0x000000000000001e, 0x000000000000001e, 0x000000000000001e])
julia> BioSequenceGraphs.find_unique_mer(idx, mer"ATCAAGAAATCCTCTCTGATCCTTGGTATTC"dna)
(false, (Node: 0, Base position: 0))
julia> BioSequenceGraphs.find_unique_mer(idx, mer"ATCAAGAAATCCTCACTGATCCTTCCTATTC"dna)
(true, (Node: 625, Base position: 9261))
This will be the basis of a reliable, but somewhat strict paired-end read mapper. Paired-end reads are highly accurate in base composition and so mapping using the unique kmers in a graph is a simple method that we’ve found gives pretty good results, and often, any reads that don’t neatly sit inside a node of the graph can be thrown away without discarding too much good information. But of course other kmer indexes for mappers will be included, and chaining of paired-end read mappings should also be doable, we’ve just not had reason to do it yet. Long reads are a different matter.
I’ve also started plans for visualizing graphs. Currently this has no code, as I’m planning how we should view and vizualize GenomeGraphs in julia - drafting any pseudocode with pen and paper until I’m happy with the overall plan.