Interest in RNA-seq specific convenience package based on BioJulia?

maltesie · April 1, 2023, 2:16pm

Hi all,

I’m working in Bioinformatics research and started using Julia 2 years ago and now use it as my main language for everything I do. I organize the reusable part of my code in form of a Julia package which I try to keep rather general, but in its current form its still basically my personal toolbox. My question would be if there is any interest in this in the community. Then I would try to make a more generally usable and fully documented and tested package out of it.

So, what is it? I call it RNASeqTools (github) and it builds a convenience layer on top of BioJulia specifically for RNA-seq data. It provides data structures and constructors and some commonly used (by me) functions to conveniently read and then manipulate and/or combine:

read files (.fastq, .fastq.gz, .fasta, fasta.gz)
genome files (.fa)
annotation files (.gff)
alignment files (.bam)
coverage files (.bw)
counts files (.csv)
folders of files of the above types

Some functionality provided by the package:

convenient iterators for FASTX and XAM records
efficient annotation of alignments with IntervalTrees
handling of chimeric reads alignments
computation of coverage from alignments
simple signal detection in coverage
computation of feature counts from coverage or alignments
simple differential gene expression with feature counts using GLMs
motif generation and search in genomes with BioSequences
sample demultiplexing
some wrappers for external tools (bwa-mem2, minimap2, samtools, fastp)

I know this is pretty bloated but I could see parts of all of this being usefull to others (maybe split into parts?), so I would be happy about any suggestions on what a useful package could look like and would contain.

Thanks already

kevbonham · April 1, 2023, 2:54pm

I have a few thoughts. First, great effort, and thanks for starting this conversation! Always great to have more people joining the Julia biologist ranks

Second, there have been a number of in people that have expressed interest in functionality for RNAseq. I’m certainly not familiar with all of those efforts, nor how far any of them have gotten, but it would be great if the various peoplr were in conversation so that effort is not duplicated. This is not to say that people can’t work on their own separate projects if they have different ideas, but I think better to be in dialogue then work in a silo.

Last, and this is based solely on what you’ve written here, not looking at your code, but your bullet points seem to me to be a mix of things that

Might be good in a package or series of packages
Might be better as PRs to existing BioJulia packages
(and this I’m really keen on) might be great to document in blog posts or tutorials.

Re: #1 and #2 - I have some experience being in a similar situation. As I started working on the microbiome in Julia, I started out just aggregating everything I needed into a big bespoke package. As I started gaining more confidence in how things worked though, I started making PRs to other packages, who were often very happy to have the contribution. This has had quite a few benefits, not the least of which is I learned a ton from people who are way better coders than me

Re: #3 I think one of the bigger things lacking in the BioJulia ecosystem right now is lack of tutorials showing how to do common things. If you’re willing to write up some of your workflows, especially given how many different BioJulia packages you seem to be making use of, I think they could be super helpful. We could dust off Tutorials are quite outdated · Issue #3 · BioJulia/BioTutorials · GitHub … For real this time

maltesie · April 1, 2023, 3:19pm

Thanks for the quick reply

I would love to get into contact with people working on similar things. I did not get in touch with the community or BioJulia package maintainers so far because I lacked confidence in my understanding of Julia and the whole BioJulia ecosystem, but slowly feel like I get the hang of it.

As I understood your answer, it would be good to split what I have and check with the existing projects if I could put fitting parts into PR (actually never did this with a public project ). And then see if I can put the rest into smaller packages with more specific applications or something like this?

I could see myself writing a tutorial, although I never did something like this. I will check the issue you referenced. Thanks again.

kevbonham · April 1, 2023, 7:32pm

Great time to learn!

Maybe. It could be that there’s some core of functionally that makes sense as one package, and then the rest would be PRs.

I took a glance at your code, and I could see room for improvement, but there’s clearly a lot to work with.

You definitely wouldn’t be on your own. There actually already is an RNA seq tutorial, but it’s like 3-4 years old and very outdated. You could just update that using what you usually do. That would also be a nice way for others to see the way you code and offer suggestions or improvements.

jonathanBieler · April 1, 2023, 8:12pm

There’s some stuff that clearly should be in base packages (if the functionality isn’t present already), e.g bam_chromosome_names(reader::BAM.Reader). It’s adding a method to BAM.Reader and it’s a general interest (not tied to your particular project).

One issue with Bio packages is that they provide low level interface to read/write files that aren’t super convenient to use for basic tasks. I think your code suffers a bit from this (lots of readers and writers) and that should be tackled with some intermediary file handling & pipeline tools. One idea would be to add Bio types to FileIO. Some discussion here : High level file processing · Issue #76 · BioJulia/FASTX.jl · GitHub

I wonder if you could rewrite this piece of code :

github.com

maltesie/RNASeqTools/blob/master/src/preprocess.jl#L194


      
          function trim_fastp(input_files::PairedSingleTypeFiles;
              fastp_bin=nothing, prefix="trimmed_", adapter=nothing, trim=nothing, trim_loc=:read1, min_length=20,
              max_length=nothing, cut_front=true, cut_tail=true, trim_poly_g=nothing, trim_poly_x=10, filter_complexity=nothing,
              average_window_quality=20, deduplicate=false, skip_quality_filtering=false, overwrite_existing=false)
          
              files = Vector{Tuple{String, Union{String, Nothing}}}(input_files.list)
              trim_fastp(files; fastp_bin=fastp_bin, prefix=prefix, adapter=adapter, trim=trim, trim_loc=trim_loc,
                          min_length=min_length, max_length=max_length, cut_front=cut_front, cut_tail=cut_tail,
                          trim_poly_g=trim_poly_g, trim_poly_x=trim_poly_x, filter_complexity=filter_complexity,
                          average_window_quality=average_window_quality, deduplicate=deduplicate,
                          skip_quality_filtering=skip_quality_filtering, overwrite_existing=overwrite_existing)
              return PairedSingleTypeFiles([(joinpath(dirname(file1),prefix*basename(file1)), joinpath(dirname(file2),prefix*basename(file2)))
                  for (file1, file2) in input_files if !startswith(basename(file1), prefix) | !startswith(basename(file2), prefix)], input_files.type, input_files.suffix1, input_files.suffix2)
          end
          
          function split_each_read(file::String, split_after::Int; suffix1="_1", suffix2="_2", overwrite_existing=false)
              is_fastq = any([endswith(file, ending) for ending in FASTQ_TYPES])
              is_zipped = endswith(file, ".gz")
          
              f = is_zipped ? GzipDecompressorStream(open(file, "r")) : open(file, "r")
              reader = is_fastq ? FASTQ.Reader(f) : FASTA.Reader(f)

Using my BioRecordProcessing package (which is very limited but helps for that type of tasks) :

I don’t know much about RNA-seq data but AFAIK DESeq/2 is the gold standard when it comes to differential gene expression. Maybe a differential gene expression package that could use different methods (yours, DESeq, …) could be useful.

songtaogui · April 2, 2023, 6:25am

I am not an expert in RNA-Seq related analyses. From my point of view, RNA-Seq analysis was currently stick with R, mainly because several commonly used R packages, such as sva, DESeq, WGCNA … more bioinformers would turn to julia if we could implement the core codes of these packages in julia.

maltesie · April 3, 2023, 9:23am

Yep

Cool, thanks. I’m not sure how to organize this so its easy for people to give me feedback. I guess I leave it in the repo and people can open issues? (I guess I was working in a silo the last years , will try to change that now.)

That sounds like a good start. I will try to make a plan this week and write a blog post here about it.

I think I did not fully understand what you mean. I thought this is what I was trying to solve with my package. To have an abstraction layer that makes the file interfacing easier. I will check the issue you referenced, maybe then I get it

Yes, definitely. I checked your code and find it way more elegant than my folder handling. The preprocess.jl file in my opinion is also very specific to my projects, i.e. the wrappers to existing tools should probably come in the form of a jll or something like this to be more general.

That sounds like a good idea. Differential gene expression would definitely attract more people. What I have implemented is basically DESeq2 light. It can do the normalization and the evaluation using a GLM but not the fancy stuff like dispersion trend estimation or bayesian shrinkage. I also think it would be cool to have more than just a copy of DESeq2 in Julia but maybe combine what different tools (edgeR, other normalization methods like cqn, ruv) are good at.

Thanks for all those great answers

maltesie · April 11, 2023, 4:44pm

I’m very sorry for the delay, I need some time to read into BioJulia’s contribution guidelines, github branching models and Julia code style guidelines

I have a few questions regarding the RNA-seq tutorial:

Where would be an appropriate place to discuss the content of such a tutorial? (I checked the BioJulia discord server and it seemed rather quiet there)
Should the tutorial rely on the exported functions from my package or should I introduce everything that is not directly from BioJulia packages?
Should my tutorial replace the old one or should I put it in a separate folder? I work with prokaryotic systems and have some assumptions in my code that probably do not let it scale to eukaryotic genomes well. Maybe it makes sense to have two? But most things would still be the same, I guess.
I normally read all the content from a file into memory. This makes sense in many cases for me but maybe this is not a good or general way and it would be better to have a “per record” approach in the tutorial?

kevbonham · April 11, 2023, 5:16pm

Yeah, I think slack #biology or directly on github makes the most sense.

Definitely ok to use your package. It will be a good way to get discussions going about your functions too.

The old one is quite broken. So fine to replace it, but feel free to start from scratch if you prefer. That said, I bet we can do something about scaling to Euks

Probably, eventually, but I would not overthink it. Writing something that works is better than having nothing or only a broken thing. We can iterate, improve, and work on efficiency / performance as we go.

ForestRaven · April 12, 2023, 8:24am

Hi @maltesie ,

Thanks for your interest in bioinformatics. While I’m not using RNA-seq myself, contributing to other packages would be great (XAM and the likes) for easier I/O.

I’m quite a beginner to julia but I agree with @songtaogui bioinformatics seems to lack momentum at the moment, compared to R. Maybe writing tutorials could change that.

jonathanBieler · April 12, 2023, 12:29pm

On my side I started to work on making BioRecordsProcessing.jl more generic, the idea is to have a pipeline object that takes a source (single/paired file, folder or iterable in memory), a processing function and a sink that writes or collects the output. I don’t want to make it a full-fleshed pipeline system, but I think this would cover a lot of common operations.

using BioRecordsProcessing, FASTX, BioSequences
filename = "/test/data/illumina_full_range_as_illumina.fastq"

p = Pipeline(
    Reader(
        FASTX.FASTQ,
        File(filename)
    ),
    record -> begin
        return sequence(LongDNA{2}, record)
    end,
    Collect(LongDNA{2}),
)

run(p; max_records = 100)

To write to disk you would just change the Collect to a Writer with the appropriate record type, and for processing a folder you would just have to change the File to a Folder with a glob pattern.

maltesie · April 18, 2023, 9:52pm

That looks cool I have a bit of time over the weekend and will see where I can rewrite my code using your package.

jonathanBieler · April 19, 2023, 7:26am

I pushed my changes yesterday, it’s not released yet so you have to ]add BioRecordsProcessing#main. I also updated the docs : Manual · BioRecordsProcessing.jl

I got good test coverage but there might be some bugs still. In terms of features I would maybe add options to group reads in BAM and intersect them (or other indexed files) with a genomic interval.

tecosaur · April 19, 2023, 8:21am

I needed some basic dispersion estimation in a project of mine recently, and I used the formula in Marginal likelihood estimation of negative binomial parameters with applications to RNA-seq data | Biostatistics | Oxford Academic with Optim:

"""Marginal likelihood for α of a NegativeBinomial."""
function ℓnbinα(xs)
    n = length(xs)
    x̄ = mean(xs)
    α -> logbeta(n*x̄, n/α) - log(α+1) -
        sum(logbeta(x + 1, 1/α) + log(x + 1/α) for x in xs)
end

Optim.maximize(ℓnbinα(expression_values), 1e-6, 1e2).res.minimizer

maltesie · April 21, 2023, 12:15pm

Cool, thanks a lot. I will have a look and try to add it to my code.

Emiller · July 17, 2023, 2:43pm

I might have missed this being discussed somewhere already in the thread.

What about wrapping some R packages?

kevbonham · July 17, 2023, 4:11pm

This could be done, though you can always use the R packages directly using RCall.

Topic		Replies	Views
lhe/Biofast benchmark \| FASTQ parsing [Julia,Nim,Crystal,Python,...] Biology, Health, and Medicine performance , benchmark , community , biology	54	8587	May 25, 2020
[ANN] Nucleotide_Essentials.jl - Support for some basic first steps in analyzing Illumina sequencing data! Package Announcements package , announcement , biology	3	422	April 15, 2022
Building some Data analysis Tutorials Statistics tutorials	35	3926	July 24, 2020
Learning Julia for scientists who are beginning programmers Teaching & Outreach	43	3568	April 23, 2024
Things that are easier in Julia than Python/R etc Community python , r	60	6995	October 17, 2021

Interest in RNA-seq specific convenience package based on BioJulia?

Related topics