Julia for processing Next-Generation Sequencing (NGS) datasets

Bonjour-Lemonde · April 23, 2024, 5:35am

Hi all! I want to inquire how to use Julia for processing Next-Generation Sequencing (NGS) datasets, especially to merge paired-end sequencing reads. I think currently there seems no suitable packages in Julia as pandaseq in python for doing this. Importantly, quality information in the Illumina reads is important for score and evaluate alignment of paired-end sequences1. But it seems no packages in Julia has considered and used this information. Thus I want to know whether there are Julia packages that can handle the assembly problem and also I wanted to know suggestions to use python in Julia as well. THANKS!

jakobnissen · April 23, 2024, 6:51am

Welcome, @Bonjour-Lemonde !

That’s right, there is no package in Julia to merge paired-end NGS sequences. Julia currently has a bunch of low-level packages for Bioinformatics, such as FASTX.jl to parse FASTQ files, and BioAlignments.jl to do S/W alignment.

However, basic NGS tasks like read trimming and merging and assembly is usually best done with existing command-line tools which tend to be written in C or C++. For Illumina reads, I would recommend fastp for trimming and merging, and SPAdes for assembly.

Julia is suitable when you need to do a truly custom analysis, e.g. when developing new techniques in the field. For most standard analyses, I would use existing tools.

kevbonham · April 23, 2024, 10:58am

Agree with this completely. There’s no reason in principle that Julia couldn’t be used to write such tools, but

Given limited resources, no one has considered it worth it to duplicate the effort
Julia is not (yet?) a great choice for developing command line tools, which most biologists expect.

Depending on your application, there are some Julia packages for downstream analysis (eg SingleCellProjections.jl if you’re doing scRNAseq), and lots of stuff in the stats/ML space.

jonathanBieler · April 23, 2024, 1:08pm

I’m using Julia regularly to do custom read trimming & processing, I think doing something like PANDAseq should be relatively straightforward to implement around existing packages, although that’s maybe more a package developer project than end-user one.

kevbonham · April 23, 2024, 3:04pm

Oh neat - any chance you’d be willing to write up a tutorial or cookbook recipe for BioTutorials?

jonathanBieler · April 23, 2024, 5:24pm

I could, although the difficulty is to find a realistic use case that isn’t too boring (otherwise it’s just this) and public data that goes with it.

Topic		Replies	Views
[ANN] Nucleotide_Essentials.jl - Support for some basic first steps in analyzing Illumina sequencing data! Package Announcements package , announcement , biology	3	422	April 15, 2022
Interest in RNA-seq specific convenience package based on BioJulia? Biology, Health, and Medicine	16	1403	July 17, 2023
[blogpost] From FASTQ to CNV calls in Julia Community biology , blog-post	0	178	April 17, 2024
Fastq to OTU table pipeline in julia Specific Domains biology	0	245	November 13, 2022
Falling Behind - Julia for genomics? Biology, Health, and Medicine	7	930	July 8, 2024

Julia for processing Next-Generation Sequencing (NGS) datasets

Related topics