Hi, I am a python programmer and new to Julia. One of the tasks I am working on needs modification of description lines in Fastq sequence files.
Doing this in python was pretty time-consuming. So wanted to try the same in Julia with the FASTX package.
Can anyone recommend a method or provide an example to loop through each description and substitute with a modified line and writing it back to Fastq in an efficient way?
Here is the pseudocode of my python script
for every_line in fastq:
if line startswith(’@) # description line
Welcome! This should be pretty straightforward, and may be faster than with python, especially if you don’t actually have to store/process the sequences themselves. Can you share what you’ve tried and what issues you’re running into?
It may also be helpful to take a look a look at this post which has some tips on how to format stuff if you’re including code.
Did you read the documentation?
What is it you do not understand?
You should check the documentation.
Nonetheless, I couldn’t resist giving an implementation a go. Here’s a simple implementation:
import FASTX.FASTQ: identifier, description, sequence, quality, Reader, Writer, Record
function modify_descriptions(f, inp::IO, out::IO)
reader, writer = Reader(inp), Writer(out)
for rec in reader
write(writer, Record(identifier(rec), f(rec), sequence(String, rec), quality(rec)))
You can use it like so:
f(x) = description(x) * "_with_extra_stuff"
inp = open("/my/input.fastq")
out = open("/tmp/test", "w")
modify_descriptions(f, inp, out)
It’s not optimized. It does around 65 MB/s on my computer - good enough for most use cases? A fast version would need to have the following changes:
- Iterate over the FASTQ reader by overwriting a single FASTQ record until end of file
- Modify the description of the record in-place, ideally without any heap allocations
- Use a fork of FASTX with commit 18a160b merged
With these changes, it would probably be > 500 MB/s uncompressed, or at whatever speed your computer can gzip compress with.