Questions about XAM.jl

When I used Julia’s package called XAM.jl,I was convinced that it was extraordinarily fast.
But when I aligned the reads to the genome and then generated a bam file, how do I restore sequenced reads to undeleted and uninserted reads from the reference genome sequence by the cigar value(like the picture below) in the bam file?Are there any functions in XAM.jl?If I input (cigar,reads,ref),how can i return the the sequence with no deletions and no insertions?

cigar="5M1D5M2I5M1D3M"
reads="ACAAACCCCCCCTTTTTGGG"
ref="AAAAATCCCCCTTTTTGGGG"

I want to return ACAAATCCCCCTTTTTGGGG,like no deletions and no insertions but allowing mismatches.Thank you for helping me!

@jakobnissen I guess

You can use BAM.alignement(read) to get a alignement object from BioAlignements.

I think you can just iterate over it in your case, but you can also use BioAlignements.ref2seq or BioAlignements.seq2ref to query the alignement from the ref or seq perspective.

using BioAlignments, BioSequences

seq = dna"ACAAACCTTTGGG"
ref = dna"ACTAACCCCCCCTTTTTGGG"

scoremodel = AffineGapScoreModel(EDNAFULL, gap_open=-5, gap_extend=-1);
res = pairalign(GlobalAlignment(), seq, ref, scoremodel)
aln = alignment(res)

aln_seq = [a[2] for a in aln if a[1] != DNA_Gap]

julia> aln
PairwiseAlignment{LongSequence{DNAAlphabet{4}}, LongSequence{DNAAlphabet{4}}}:
  seq:  1 ACAAACC-------TTTGGG 13
          || ||||       ||||||
  ref:  1 ACTAACCCCCCCTTTTTGGG 20

julia> LongDNA{4}(aln_seq)
13nt DNA Sequence:
ACTAACCTTTGGG
1 Like

Thanks.But my problem is that I have the BAM file now, and I want to change the sequence of reads into reads without deletion and insertions by reference and cigar instead of using BioAlignment package to align the reads

Use BAM.alignement to get the alignement from the read, I just used pairalign for the example.

I am so sorry ,I can’t understand your meaning.When I used the BAM.alignement ,it can only return cigar not the sequence .

Simply ,if I input the cigar,reads and seq,how can i return the sequence
ACAAAT CCCCCTTTTTG GGG(in this returned sequence I inserted T and G and deleted double C by the ref and cigar)? Thank you.

BAM.alignment(record) returns a BioAlignments.Alignment struct. The printed output that you see is generated by BioAlignments’ show method. It is true that this struct is essentially a representation of the cigar string. But as @jonathanBieler suggested, the representation can be used with PairwiseAlignment to map to both sequence and reference positions.

# Setup example.
cigar = "5M1D5M2I5M1D3M" # cigar = BAM.cigar(record).
seq = dna"ACAAACCCCCCCTTTTTGGG" # seq = BAM.sequence(record).
ref = dna"AAAAATCCCCCTTTTTGGGG" # This is the reference sequence used when generating the alignment. This should be available, but if not maybe explore https://hgdownload.soe.ucsc.edu/goldenPath/).
target = dna"ACAAATCCCCCTTTTTGGGG"

# Retrieve alignment.
aln = Alignment(cigar) # aln = BAM.alignment(record)

# Construct pairwise alignment.
paln = PairwiseAlignment(AlignedSequence(seq, aln.anchors), ref)

# Process alignments.
itr = Base.Generator(paln) do a

	# Unpack a.
	(sym_seq, sym_ref) = a

	# Implement logic...
	if (sym_seq == DNA_Gap)
		return sym_ref
	end

	if (sym_ref != DNA_Gap)
		return sym_seq
	end

	return nothing
end

# Filter out nothings.
itr = Iterators.filter(!isnothing, itr)

# Collect result.
result = LongDNASeq(collect(itr))

# Compare the result with the target sequence.
result == target
1 Like

Thanks