Extract alignments from BAM and associated FASTA reference genome

Gio · October 25, 2019, 9:42am

Hello everybody,
I’m new to Julia/BioJulia but I’m already appreciating some features of both the language and the Bio libraries (really nice work!). I’m currently trying to do simple stuff to get acquainted with the BioJulia ecosystem.

As the title implies I’m currently trying to extract the alignment from BAM files with respect with their reference Genome at certain genomic intervals already extracted from VCF files.
I used separately BAM.Reader and FASTA.Reader in order to parse the respective files. I’m wondering if there’s already a standard way to extract the alignment (maybe exploiting the CIGAR representation from BAM records.) between BAM and FASTA of the reference genome.
Any hints?

Thanks,
Gio

jonathanBieler · October 25, 2019, 12:37pm

There’s an alignment object that allows to query the read sequence from reference coordinate (ref2seq) or the inverse (ref2seq). Here’s an example:

using BioAlignments, BioSequences

typeof(record) # BioAlignments.BAM.Record
BioAlignments.BAM.cigar(record) # 36M2D107M

aln = BioAlignments.BAM.alignment(record)
seq = BioAlignments.sequence(record)

ref_range = aln.firstref:aln.lastref

refseq = genome[BioAlignments.BAM.refname(record)][ref_range]

pos_and_operation = [pos=>BioAlignments.ref2seq(aln, pos) for pos in aln.firstref:aln.lastref]

Main>pos_and_operation[35:40]
6-element Array{Pair{Int64,Tuple{Int64,Operation}},1}:
 116338988 => (35, OP_MATCH) 
 116338989 => (36, OP_MATCH) 
 116338990 => (36, OP_DELETE)
 116338991 => (36, OP_DELETE)
 116338992 => (37, OP_MATCH) 
 116338993 => (38, OP_MATCH) 

aligned_seq = BioSequences.DNASequence([
     p[2] == OP_MATCH ? seq[p[1]] : DNA_Gap for (refpos, p) in pos_and_operation 
 ]) #this will fail is there's other cigar operations

Main>aligned_seq[30:50]
21nt DNA Sequence:
ACTTGCT--AGTTTTTGTTTT

And that’s how I Ioad my genome (I’m using ProgressMeter.jl):

function load_genome(genome_path, haschr=true)

    genome_index = string(genome_path,".fai")
    f = open(genome_path)
    reader = BioSequences.FASTA.Reader(f; index=genome_index)
       
    genome = Dict{String,BioSequences.BioSequence{BioSequences.DNAAlphabet{4}}}()
       
    @showprogress 1 "Loading genome..." for i in [map(string,1:22); "X"; "Y"]
        genome[i] = haschr ? BioSequences.FASTA.sequence(reader[string("chr",i)]) : BioSequences.FASTA.sequence(reader[string(i)])
    end
    close(f)
    genome
end

And this to find a read with a deletion :

findfirst([any(OP == BioAlignments.OP_DELETE for OP in BioAlignments.BAM.cigar_rle(r)[1]) for r in records])

Gio · October 28, 2019, 11:49am

@jonathanBieler thanks for the very helpful reply!

Topic		Replies	Views
Questions about XAM.jl Biology, Health, and Medicine question	8	749	October 4, 2022
Extract start and end positions from a PairwiseAlignment Biology, Health, and Medicine question , biology	6	678	March 10, 2022
[blogpost] From FASTQ to CNV calls in Julia Community biology , blog-post	0	178	April 17, 2024
Julia package to calculate coverage and depth from sam/bam files Biology, Health, and Medicine question , package , biology , io	4	1104	November 23, 2021
How to retrieve indices from bioalignments? General Usage biology	1	563	September 21, 2020

Extract alignments from BAM and associated FASTA reference genome

Related topics