Modify FASTQ file description with a new string using FASTX

Gopi1616 · September 15, 2020, 11:20pm

Hi, I am a python programmer and new to Julia. One of the tasks I am working on needs modification of description lines in Fastq sequence files.
Doing this in python was pretty time-consuming. So wanted to try the same in Julia with the FASTX package.
Can anyone recommend a method or provide an example to loop through each description and substitute with a modified line and writing it back to Fastq in an efficient way?

Here is the pseudocode of my python script

for every_line in fastq:
if line startswith('@) # description line

kevbonham · September 16, 2020, 12:35am

Welcome! This should be pretty straightforward, and may be faster than with python, especially if you don’t actually have to store/process the sequences themselves. Can you share what you’ve tried and what issues you’re running into?

It may also be helpful to take a look a look at this post which has some tips on how to format stuff if you’re including code.

ufechner7 · September 16, 2020, 12:42am

Did you read the documentation?
https://biojulia.net/FASTX.jl/stable/manual/fastq/

What is it you do not understand?

jakobnissen · September 22, 2020, 8:51am

You should check the documentation.

Nonetheless, I couldn’t resist giving an implementation a go. Here’s a simple implementation:

import FASTX.FASTQ: identifier, description, sequence, quality, Reader, Writer, Record

function modify_descriptions(f, inp::IO, out::IO)
    reader, writer = Reader(inp), Writer(out)
    for rec in reader
        write(writer, Record(identifier(rec), f(rec), sequence(String, rec), quality(rec)))
    end
end

You can use it like so:

f(x) = description(x) * "_with_extra_stuff"
inp = open("/my/input.fastq")
out = open("/tmp/test", "w")
modify_descriptions(f, inp, out)
close(inp)
close(out)

It’s not optimized. It does around 65 MB/s on my computer - good enough for most use cases? A fast version would need to have the following changes:

Iterate over the FASTQ reader by overwriting a single FASTQ record until end of file
Modify the description of the record in-place, ideally without any heap allocations
Use a fork of FASTX with commit 18a160b merged

With these changes, it would probably be > 500 MB/s uncompressed, or at whatever speed your computer can gzip compress with.

Topic		Replies	Views
[ANN] Nucleotide_Essentials.jl - Support for some basic first steps in analyzing Illumina sequencing data! Package Announcements package , announcement , biology	3	422	April 15, 2022
Indexing a fasta file with FASTX.jl Biology, Health, and Medicine question , package	1	1085	December 2, 2021
[blogpost] From FASTQ to CNV calls in Julia Community biology , blog-post	0	178	April 17, 2024
Streaming gziped file to FASTQ.Reader - where to add method? General Usage question , biology , input-output	2	1038	March 20, 2020
BioJulia Fastx import long{4} Biology, Health, and Medicine biology	1	493	March 14, 2023

Modify FASTQ file description with a new string using FASTX

Related topics