Questions about XAM.jl

When I used Julia’s package called XAM.jl,I was convinced that it was extraordinarily fast.
But when I aligned the reads to the genome and then generated a bam file, how do I restore sequenced reads to undeleted and uninserted reads from the reference genome sequence by the cigar value(like the picture below) in the bam file?Are there any functions in XAM.jl?If I input (cigar,reads,ref),how can i return the the sequence with no deletions and no insertions?

cigar="5M1D5M2I5M1D3M"
reads="ACAAACCCCCCCTTTTTGGG"
ref="AAAAATCCCCCTTTTTGGGG"

I want to return ACAAATCCCCCTTTTTGGGG,like no deletions and no insertions but allowing mismatches.Thank you for helping me!

@jakobnissen I guess

You can use BAM.alignement(read) to get a alignement object from BioAlignements.

I think you can just iterate over it in your case, but you can also use BioAlignements.ref2seq or BioAlignements.seq2ref to query the alignement from the ref or seq perspective.

using BioAlignments, BioSequences

seq = dna"ACAAACCTTTGGG"
ref = dna"ACTAACCCCCCCTTTTTGGG"

scoremodel = AffineGapScoreModel(EDNAFULL, gap_open=-5, gap_extend=-1);
res = pairalign(GlobalAlignment(), seq, ref, scoremodel)
aln = alignment(res)

aln_seq = [a[2] for a in aln if a[1] != DNA_Gap]

julia> aln
PairwiseAlignment{LongSequence{DNAAlphabet{4}}, LongSequence{DNAAlphabet{4}}}:
  seq:  1 ACAAACC-------TTTGGG 13
          || ||||       ||||||
  ref:  1 ACTAACCCCCCCTTTTTGGG 20

julia> LongDNA{4}(aln_seq)
13nt DNA Sequence:
ACTAACCTTTGGG
1 Like

Thanks.But my problem is that I have the BAM file now, and I want to change the sequence of reads into reads without deletion and insertions by reference and cigar instead of using BioAlignment package to align the reads

Use BAM.alignement to get the alignement from the read, I just used pairalign for the example.

I am so sorry ,I can’t understand your meaning.When I used the BAM.alignement ,it can only return cigar not the sequence .

Simply ,if I input the cigar,reads and seq,how can i return the sequence
ACAAAT CCCCCTTTTTG GGG(in this returned sequence I inserted T and G and deleted double C by the ref and cigar)? Thank you.

BAM.alignment(record) returns a BioAlignments.Alignment struct. The printed output that you see is generated by BioAlignments’ show method. It is true that this struct is essentially a representation of the cigar string. But as @jonathanBieler suggested, the representation can be used with PairwiseAlignment to map to both sequence and reference positions.

# Setup example.
cigar = "5M1D5M2I5M1D3M" # cigar = BAM.cigar(record).
seq = dna"ACAAACCCCCCCTTTTTGGG" # seq = BAM.sequence(record).
ref = dna"AAAAATCCCCCTTTTTGGGG" # This is the reference sequence used when generating the alignment. This should be available, but if not maybe explore https://hgdownload.soe.ucsc.edu/goldenPath/).
target = dna"ACAAATCCCCCTTTTTGGGG"

# Retrieve alignment.
aln = Alignment(cigar) # aln = BAM.alignment(record)

# Construct pairwise alignment.
paln = PairwiseAlignment(AlignedSequence(seq, aln.anchors), ref)

# Process alignments.
itr = Base.Generator(paln) do a

	# Unpack a.
	(sym_seq, sym_ref) = a

	# Implement logic...
	if (sym_seq == DNA_Gap)
		return sym_ref
	end

	if (sym_ref != DNA_Gap)
		return sym_seq
	end

	return nothing
end

# Filter out nothings.
itr = Iterators.filter(!isnothing, itr)

# Collect result.
result = LongDNASeq(collect(itr))

# Compare the result with the target sequence.
result == target
1 Like

Thanks

Hello.I have another question.
I only want to exact the sequence of chr1 not all.

using XAM
bamfile="..."
reader=open(BAM.Reader,bamfile)
for record in reader
    println(BAM.refname(record))
end

This returns all the chromosomes.I just want chr1.But I dont know how to do.
I tried eachoverlap,but the function need start(1) and end(10000).

reader=open(BAM.Reader,bamfile,index=bamfile*".bai")
for record in eachoverlap(reader,"chr1",1:10000)
    println(BAM.refname(record))
end

and i also tried

reader.refseqlens
25-element Vector{Int64}:
 248956422
 133797422
 135086622
 133275309
 114364328
 107043718
 101991189
  90338345
  83257441
  80373285
  58617616
 242193529
  64444167
  46709983
  50818468
 198295559
 190214555
 181538259
 170805979
 159345973
 145138636
 138394717
     16569
 156040895
  57227415
for record in eachoverlap(reader,"chr1",1:248956422)
    println(BAM.refname(record))
end

It can return all of the chr1. I can only make it using eachoverlap(slow?).Are there any ways to make it faster?Thanks very much.