Reading/parsing .ab1 sequencing files

Hi,

We currently get some sequencing done on 96 well plates. The lab guys manually check the chromatograms (ab1) files for clean reads. But I was wondering if there was a julia package able to open and retrieve the information for analysis from such files?

I’ve seen similar packages in R and python.

Cheers,

J

Do you know what the format of the file you’ll get back is? I’d guess fasta or fastq, and in either case BioSequences.jl is probably the package you want.

So we get 4 files per sequence back:
.ab1 <-- chromatogram I am interested in, as I understand it, it also contain the signal strenth of each base.
.fas <-- this is a fasta file?
.scf <-- no idea what this is?
.seq <-- seems like similar content to .fas

I have used the biosequences.jl to read and do a few things with the .fas file, but I dont think it contains the same signal information per position ( just the processed data, this is the base with the strongest result)

I’m sorry if I’m not completely clear, I’m fairly new to the data output and Julia’s bio packages both at once!

Cheers

The way I understand it, ab1 is a the raw output of the machine, has the chromatogram and information about the run. fas has an initial sequence call, scf has information about sequence quality and seq has a trimmed sequence based on the scf file.

I don’t know of any tool that allows you to interact with ab1 files in Julia (I might be wrong) and most tools are GUI’s that allow users to check the chromatograms manually and change base calls. I think it can be easy to parse for the calls and basically obtain and equivalent to the fas file, but to actually check the chromatograms you need a tool that will transform the base call, the time and the quality into the usual chromatogram you see in tools like Sequencer.

I’m no expert with BioJulia, but I get the impression most of the tools thus far allow you to work with files that are mostly text based, like FASTA and newer ones.

I think this is right. It might be kinda cool to hook into Plots or Makie and do chromatograms, but I’m not sure that’s worth the development effort.

I’d suggest going of the .seq file, since calling bases from Sanger sequencing is a process I doubt any of us are going to improve upon, and presumably the sequencing center you’re using knows what they’re doing.

And if it’s formatted like fasta, just use the fasta reader in BioSequences :slight_smile:

Thanks for the replies.
We do use the fasta for processing, but the scientists still have to manually check the chromatogram for “background noise” in each read.
I was just hoping I could make a first pass in julia and flag reads as either clean (clear signal for one base), or flagging them for needing checking as there is some overlap. I think a colleague has got somewhere with this in R, but I was mainly hoping that I could do the equivalent in Julia.
I guess if it can be done in R, I could maybe use RCall or something?

You could certainly use RCall as a first pass, though anything that can be done in R should be doable in julia too. If you want to send an example of these files for a given sequence, I can take a look and maybe point you in the right direction. A PR for a reader for these files would almost certainly be a welcome addition to BioSequences.jl (and I and other folks here would be happy to provide guidance).

1 Like

Don’t you get that directly from the machine with some phred scores? Or is that what you want, your own implementation of phred scores in Julia? When I did something like this back in the day, we will look at how defined the peaks were compared to background noise, we looked for overlap between peaks that might cause confusion in base calling and that was mostly it. As @kevbonham said, you might need to provide more info about your goals and files.

If the ab* files have a well-defined format, then I see no reason we couldn’t add some parsers to the ecosystem in the near future if it was something killer that people really wanted. But I work for an institute that provides a good deal of our nation’s sequencing capability, and we rarely have any reason to look at anything preceding the factq files, especially for Illumina.

I was looking for the same a few months ago and came across the reader in BioSequences…

using BioSequences

# Sample .ab1 files at
# https://github.com/BioJulia/BioFmtSpecimens.git
# open a valid ab1 file
stream = open(ABIF.Reader, "BioFmtSpecimens/ABI/3730.ab1")

# all existing data tags
data = ABIF.get_tags(stream)

# Basecaller primary peaks
primary_peaks = stream["PBAS"]["PBAS2"]

# Base order array
bases = [base for base = stream["FWO_"]["FWO_"]]

# Dictionary of base trace values
channels = ["DATA9", "DATA10", "DATA11", "DATA12"]
trace_data = [stream["DATA"][channel] for channel = channels]
traces = Dict(zip(bases, trace_data))

# Basecaller peak quality scores
quality_scores = [Int(value) for value = stream["PCON"]["PCON2"]]
1 Like

Thanks. That does seem to be what I was looking for, at any rate I have managed to apply it to the ab1 files I have and I can make it output stuff. Not sure what it is just yet, but its a start!