Possible to subset XAM BAM reader object by contig?

Hi all. I’m writing a script to plot depth of coverage for each contig in an alignment. So far, I haven’t found a way to subset the BAM IOStream by contig, so the loop I use to iterate through records has to invoke a condition that the record matches a contig of interest at each iteration. Here’s what the code where I invoke the loop, via a function, looks like:

   # Parse sample name from file name
    sample_name = split(basename(file_name), '.')[1]
    create_sample_dir(sample_name)
    
    # Open the BAM file
    reader = BAM.Reader(open(file_name, "r"), index="$file_name.bai")

    # loop through each contig in the alignment
    for (reference, reflength) in zip(reader.refseqnames, reader.refseqlens)

        coverage = depth_of_coverage(reader, reflength, reference)

        ...

And here’s the depth_of_coverage function:

function depth_of_coverage(bam_reader::XAM.BAM.Reader{IOStream}, reflength::Int64, ref_name::String)

    coverage_vec = fill(0, reflength)

        # Process each record (read) in the BAM file
        for record in bam_reader

            # make sure the current read is not unmapped
            !BAM.ismapped(record) && continue

            # Get the reference sequence name for the current read
            record_ref = BAM.refname(record)
            
            # make sure we are working with a read mapped to the current contig of interest
            record_ref != ref_name && continue

            # get start and end positions
            start, stop = BAM.position(record), BAM.rightposition(record)

            # add 1 to coverage dict for all those positions
            coverage_vec[start:stop] .+= 1

        end

    # return a vector of length reflength, where each position records how many reads are mapped
    return coverage_vec

end

I would prefer to subset the BAM by contig prior to looping through records, so I figured I’d pose the question here: Is there a way to subset alignments by reference contig in Julia? And more broadly, is there a pre-existing function I could use to record depth across the contig? I was unable to find one in my last round of searching.

Many thanks! Happy to provide any additional clarifications.

(And also, if any performance optimizations jump out at you in the code I pasted, I would be very interested!)

1 Like

I don’t work mucho with bam times, so don’t have much to offer, but one performance issue with your implementation (I think) is that it requires looping over the entire bam file once per contig. I would guess I/O would be the biggest bottleneck, no?

An alternative would be to loop through the file once and update the coverage dict for the contig it’s on.

That’s the crux of my question about subsetting the bam reader object: would it be possible to subset the BAM before the loop so the loop doesn’t have to go through all records for all contigs. That said, your suggestion about using a dictionary of coverages for each contig is a great one, as it gets around needing to subset the BAM. I’ll try that and see how it goes. Many thanks Kevin!

Is there some memory mapping of BAM files that could allow random access? That’s the only way I could think of

You can even do a dict of dicts so you don’t have to change your logic too much

You might try using the eachoverlap method. Then the line in depth_of_coverage would be something like

for record in eachoverlap(bam_reader, ref_name, 1:reflength)

I can’t show any benchmarks to prove it, but I have noticed significantly more performance using the eachoverlap method even if it involves multiple file reads over holding the entire BAM file in memory and filtering.

eachoverlap should do the job, but sometimes I also use this function, which advance the reader to the specified region (a genomic interval). You still have to check when you exit the region (all that assuming you have a sorted BAM).

function seek_region!(reader::BAM.Reader, region)

    refindex = findfirst(isequal(region.seqname), reader.refseqnames)
    refindex == nothing && throw(ArgumentError("sequence name $(iter.refname) is not found in the header"))
    
    chunks = XAM.BAM.Indexes.overlapchunks(reader.index.index, refindex, region.first:region.last)
    if !isempty(chunks)
        seek(reader, first(chunks).start)
    end
end