Reading/parsing .ab1 sequencing files

solf1re2 · April 26, 2019, 1:30pm

Hi,

We currently get some sequencing done on 96 well plates. The lab guys manually check the chromatograms (ab1) files for clean reads. But I was wondering if there was a julia package able to open and retrieve the information for analysis from such files?

I’ve seen similar packages in R and python.

Cheers,

J

kevbonham · April 26, 2019, 2:01pm

Do you know what the format of the file you’ll get back is? I’d guess fasta or fastq, and in either case BioSequences.jl is probably the package you want.

solf1re2 · April 26, 2019, 3:12pm

So we get 4 files per sequence back:
.ab1 ← chromatogram I am interested in, as I understand it, it also contain the signal strenth of each base.
.fas ← this is a fasta file?
.scf ← no idea what this is?
.seq ← seems like similar content to .fas

I have used the biosequences.jl to read and do a few things with the .fas file, but I dont think it contains the same signal information per position ( just the processed data, this is the base with the strongest result)

I’m sorry if I’m not completely clear, I’m fairly new to the data output and Julia’s bio packages both at once!

Cheers

alejandromerchan · April 26, 2019, 5:25pm

The way I understand it, ab1 is a the raw output of the machine, has the chromatogram and information about the run. fas has an initial sequence call, scf has information about sequence quality and seq has a trimmed sequence based on the scf file.

I don’t know of any tool that allows you to interact with ab1 files in Julia (I might be wrong) and most tools are GUI’s that allow users to check the chromatograms manually and change base calls. I think it can be easy to parse for the calls and basically obtain and equivalent to the fas file, but to actually check the chromatograms you need a tool that will transform the base call, the time and the quality into the usual chromatogram you see in tools like Sequencer.

I’m no expert with BioJulia, but I get the impression most of the tools thus far allow you to work with files that are mostly text based, like FASTA and newer ones.

kevbonham · April 26, 2019, 11:01pm

I think this is right. It might be kinda cool to hook into Plots or Makie and do chromatograms, but I’m not sure that’s worth the development effort.

I’d suggest going of the .seq file, since calling bases from Sanger sequencing is a process I doubt any of us are going to improve upon, and presumably the sequencing center you’re using knows what they’re doing.

And if it’s formatted like fasta, just use the fasta reader in BioSequences

solf1re2 · April 30, 2019, 7:48am

Thanks for the replies.
We do use the fasta for processing, but the scientists still have to manually check the chromatogram for “background noise” in each read.
I was just hoping I could make a first pass in julia and flag reads as either clean (clear signal for one base), or flagging them for needing checking as there is some overlap. I think a colleague has got somewhere with this in R, but I was mainly hoping that I could do the equivalent in Julia.
I guess if it can be done in R, I could maybe use RCall or something?

kevbonham · April 30, 2019, 7:28pm

You could certainly use RCall as a first pass, though anything that can be done in R should be doable in julia too. If you want to send an example of these files for a given sequence, I can take a look and maybe point you in the right direction. A PR for a reader for these files would almost certainly be a welcome addition to BioSequences.jl (and I and other folks here would be happy to provide guidance).

alejandromerchan · April 30, 2019, 8:38pm

Don’t you get that directly from the machine with some phred scores? Or is that what you want, your own implementation of phred scores in Julia? When I did something like this back in the day, we will look at how defined the peaks were compared to background noise, we looked for overlap between peaks that might cause confusion in base calling and that was mostly it. As @kevbonham said, you might need to provide more info about your goals and files.

Ward9250 · May 7, 2019, 2:14pm

If the ab* files have a well-defined format, then I see no reason we couldn’t add some parsers to the ecosystem in the near future if it was something killer that people really wanted. But I work for an institute that provides a good deal of our nation’s sequencing capability, and we rarely have any reason to look at anything preceding the factq files, especially for Illumina.

swiftfish · May 25, 2019, 7:26pm

I was looking for the same a few months ago and came across the reader in BioSequences…

using BioSequences

# Sample .ab1 files at
# https://github.com/BioJulia/BioFmtSpecimens.git
# open a valid ab1 file
stream = open(ABIF.Reader, "BioFmtSpecimens/ABI/3730.ab1")

# all existing data tags
data = ABIF.get_tags(stream)

# Basecaller primary peaks
primary_peaks = stream["PBAS"]["PBAS2"]

# Base order array
bases = [base for base = stream["FWO_"]["FWO_"]]

# Dictionary of base trace values
channels = ["DATA9", "DATA10", "DATA11", "DATA12"]
trace_data = [stream["DATA"][channel] for channel = channels]
traces = Dict(zip(bases, trace_data))

# Basecaller peak quality scores
quality_scores = [Int(value) for value = stream["PCON"]["PCON2"]]

solf1re2 · May 29, 2019, 12:42pm

Thanks. That does seem to be what I was looking for, at any rate I have managed to apply it to the ab1 files I have and I can make it output stuff. Not sure what it is just yet, but its a start!

Topic		Replies	Views
[blogpost] From FASTQ to CNV calls in Julia Community biology , blog-post	0	178	April 17, 2024
[ANN] Nucleotide_Essentials.jl - Support for some basic first steps in analyzing Illumina sequencing data! Package Announcements package , announcement , biology	3	422	April 15, 2022
Interest in RNA-seq specific convenience package based on BioJulia? Biology, Health, and Medicine	16	1399	July 17, 2023
Questions about XAM.jl Biology, Health, and Medicine question	8	749	October 4, 2022
Julia for processing Next-Generation Sequencing (NGS) datasets Biology, Health, and Medicine question , package	5	394	April 23, 2024

Reading/parsing .ab1 sequencing files

Related topics