When I used Julia’s package called XAM.jl,I was convinced that it was extraordinarily fast.
But when I aligned the reads to the genome and then generated a bam file, how do I restore sequenced reads to undeleted and uninserted reads from the reference genome sequence by the cigar value(like the picture below) in the bam file?Are there any functions in XAM.jl?If I input (cigar,reads,ref),how can i return the the sequence with no deletions and no insertions?
You can use BAM.alignement(read) to get a alignement object from BioAlignements.
I think you can just iterate over it in your case, but you can also use BioAlignements.ref2seq or BioAlignements.seq2ref to query the alignement from the ref or seq perspective.
using BioAlignments, BioSequences
seq = dna"ACAAACCTTTGGG"
ref = dna"ACTAACCCCCCCTTTTTGGG"
scoremodel = AffineGapScoreModel(EDNAFULL, gap_open=-5, gap_extend=-1);
res = pairalign(GlobalAlignment(), seq, ref, scoremodel)
aln = alignment(res)
aln_seq = [a[2] for a in aln if a[1] != DNA_Gap]
julia> aln
PairwiseAlignment{LongSequence{DNAAlphabet{4}}, LongSequence{DNAAlphabet{4}}}:
seq: 1 ACAAACC-------TTTGGG 13
|| |||| ||||||
ref: 1 ACTAACCCCCCCTTTTTGGG 20
julia> LongDNA{4}(aln_seq)
13nt DNA Sequence:
ACTAACCTTTGGG
Thanks.But my problem is that I have the BAM file now, and I want to change the sequence of reads into reads without deletion and insertions by reference and cigar instead of using BioAlignment package to align the reads
Simply ,if I input the cigar,reads and seq,how can i return the sequence
ACAAAT CCCCCTTTTTG GGG(in this returned sequence I inserted T and G and deleted double C by the ref and cigar)? Thank you.
BAM.alignment(record) returns a BioAlignments.Alignment struct. The printed output that you see is generated by BioAlignments’ show method. It is true that this struct is essentially a representation of the cigar string. But as @jonathanBieler suggested, the representation can be used with PairwiseAlignment to map to both sequence and reference positions.
# Setup example.
cigar = "5M1D5M2I5M1D3M" # cigar = BAM.cigar(record).
seq = dna"ACAAACCCCCCCTTTTTGGG" # seq = BAM.sequence(record).
ref = dna"AAAAATCCCCCTTTTTGGGG" # This is the reference sequence used when generating the alignment. This should be available, but if not maybe explore https://hgdownload.soe.ucsc.edu/goldenPath/).
target = dna"ACAAATCCCCCTTTTTGGGG"
# Retrieve alignment.
aln = Alignment(cigar) # aln = BAM.alignment(record)
# Construct pairwise alignment.
paln = PairwiseAlignment(AlignedSequence(seq, aln.anchors), ref)
# Process alignments.
itr = Base.Generator(paln) do a
# Unpack a.
(sym_seq, sym_ref) = a
# Implement logic...
if (sym_seq == DNA_Gap)
return sym_ref
end
if (sym_ref != DNA_Gap)
return sym_seq
end
return nothing
end
# Filter out nothings.
itr = Iterators.filter(!isnothing, itr)
# Collect result.
result = LongDNASeq(collect(itr))
# Compare the result with the target sequence.
result == target