Extract alignments from BAM and associated FASTA reference genome

Hello everybody,
I’m new to Julia/BioJulia but I’m already appreciating some features of both the language and the Bio libraries (really nice work!). I’m currently trying to do simple stuff to get acquainted with the BioJulia ecosystem.

As the title implies I’m currently trying to extract the alignment from BAM files with respect with their reference Genome at certain genomic intervals already extracted from VCF files.
I used separately BAM.Reader and FASTA.Reader in order to parse the respective files. I’m wondering if there’s already a standard way to extract the alignment (maybe exploiting the CIGAR representation from BAM records.) between BAM and FASTA of the reference genome.
Any hints?

Thanks,
Gio

There’s an alignment object that allows to query the read sequence from reference coordinate (ref2seq) or the inverse (ref2seq). Here’s an example:

using BioAlignments, BioSequences

typeof(record) # BioAlignments.BAM.Record
BioAlignments.BAM.cigar(record) # 36M2D107M

aln = BioAlignments.BAM.alignment(record)
seq = BioAlignments.sequence(record)

ref_range = aln.firstref:aln.lastref

refseq = genome[BioAlignments.BAM.refname(record)][ref_range]

pos_and_operation = [pos=>BioAlignments.ref2seq(aln, pos) for pos in aln.firstref:aln.lastref]

Main>pos_and_operation[35:40]
6-element Array{Pair{Int64,Tuple{Int64,Operation}},1}:
 116338988 => (35, OP_MATCH) 
 116338989 => (36, OP_MATCH) 
 116338990 => (36, OP_DELETE)
 116338991 => (36, OP_DELETE)
 116338992 => (37, OP_MATCH) 
 116338993 => (38, OP_MATCH) 

aligned_seq = BioSequences.DNASequence([
     p[2] == OP_MATCH ? seq[p[1]] : DNA_Gap for (refpos, p) in pos_and_operation 
 ]) #this will fail is there's other cigar operations

Main>aligned_seq[30:50]
21nt DNA Sequence:
ACTTGCT--AGTTTTTGTTTT

And that’s how I Ioad my genome (I’m using ProgressMeter.jl):

function load_genome(genome_path, haschr=true)

    genome_index = string(genome_path,".fai")
    f = open(genome_path)
    reader = BioSequences.FASTA.Reader(f; index=genome_index)
       
    genome = Dict{String,BioSequences.BioSequence{BioSequences.DNAAlphabet{4}}}()
       
    @showprogress 1 "Loading genome..." for i in [map(string,1:22); "X"; "Y"]
        genome[i] = haschr ? BioSequences.FASTA.sequence(reader[string("chr",i)]) : BioSequences.FASTA.sequence(reader[string(i)])
    end
    close(f)
    genome
end

And this to find a read with a deletion :

findfirst([any(OP == BioAlignments.OP_DELETE for OP in BioAlignments.BAM.cigar_rle(r)[1]) for r in records])
1 Like

@jonathanBieler thanks for the very helpful reply!

1 Like