Interest in RNA-seq specific convenience package based on BioJulia?

Hi all,

I’m working in Bioinformatics research and started using Julia 2 years ago and now use it as my main language for everything I do. I organize the reusable part of my code in form of a Julia package which I try to keep rather general, but in its current form its still basically my personal toolbox. My question would be if there is any interest in this in the community. Then I would try to make a more generally usable and fully documented and tested package out of it.

So, what is it? I call it RNASeqTools (github) and it builds a convenience layer on top of BioJulia specifically for RNA-seq data. It provides data structures and constructors and some commonly used (by me) functions to conveniently read and then manipulate and/or combine:

  • read files (.fastq, .fastq.gz, .fasta, fasta.gz)
  • genome files (.fa)
  • annotation files (.gff)
  • alignment files (.bam)
  • coverage files (.bw)
  • counts files (.csv)
  • folders of files of the above types

Some functionality provided by the package:

  • convenient iterators for FASTX and XAM records
  • efficient annotation of alignments with IntervalTrees
  • handling of chimeric reads alignments
  • computation of coverage from alignments
  • simple signal detection in coverage
  • computation of feature counts from coverage or alignments
  • simple differential gene expression with feature counts using GLMs
  • motif generation and search in genomes with BioSequences
  • sample demultiplexing
  • some wrappers for external tools (bwa-mem2, minimap2, samtools, fastp)

I know this is pretty bloated but I could see parts of all of this being usefull to others (maybe split into parts?), so I would be happy about any suggestions on what a useful package could look like and would contain.

Thanks already :slight_smile:

7 Likes

I have a few thoughts. First, great effort, and thanks for starting this conversation! Always great to have more people joining the Julia biologist ranks :smiling_face:

Second, there have been a number of in people that have expressed interest in functionality for RNAseq. I’m certainly not familiar with all of those efforts, nor how far any of them have gotten, but it would be great if the various peoplr were in conversation so that effort is not duplicated. This is not to say that people can’t work on their own separate projects if they have different ideas, but I think better to be in dialogue then work in a silo.

Last, and this is based solely on what you’ve written here, not looking at your code, but your bullet points seem to me to be a mix of things that

  1. Might be good in a package or series of packages
  2. Might be better as PRs to existing BioJulia packages
  3. (and this I’m really keen on) might be great to document in blog posts or tutorials.

Re: #1 and #2 - I have some experience being in a similar situation. As I started working on the microbiome in Julia, I started out just aggregating everything I needed into a big bespoke package. As I started gaining more confidence in how things worked though, I started making PRs to other packages, who were often very happy to have the contribution. This has had quite a few benefits, not the least of which is I learned a ton from people who are way better coders than me :wink:

Re: #3 I think one of the bigger things lacking in the BioJulia ecosystem right now is lack of tutorials showing how to do common things. If you’re willing to write up some of your workflows, especially given how many different BioJulia packages you seem to be making use of, I think they could be super helpful. We could dust off Tutorials are quite outdated · Issue #3 · BioJulia/BioTutorials · GitHub … For real this time :sweat_smile:

3 Likes

Thanks for the quick reply :slight_smile:

I would love to get into contact with people working on similar things. I did not get in touch with the community or BioJulia package maintainers so far because I lacked confidence in my understanding of Julia and the whole BioJulia ecosystem, but slowly feel like I get the hang of it.

As I understood your answer, it would be good to split what I have and check with the existing projects if I could put fitting parts into PR (actually never did this with a public project :sweat_smile:). And then see if I can put the rest into smaller packages with more specific applications or something like this?

I could see myself writing a tutorial, although I never did something like this. I will check the issue you referenced. Thanks again.

Great time to learn!

Maybe. It could be that there’s some core of functionally that makes sense as one package, and then the rest would be PRs.

I took a glance at your code, and I could see room for improvement, but there’s clearly a lot to work with.

You definitely wouldn’t be on your own. There actually already is an RNA seq tutorial, but it’s like 3-4 years old and very outdated. You could just update that using what you usually do. That would also be a nice way for others to see the way you code and offer suggestions or improvements.

There’s some stuff that clearly should be in base packages (if the functionality isn’t present already), e.g bam_chromosome_names(reader::BAM.Reader). It’s adding a method to BAM.Reader and it’s a general interest (not tied to your particular project).

One issue with Bio packages is that they provide low level interface to read/write files that aren’t super convenient to use for basic tasks. I think your code suffers a bit from this (lots of readers and writers) and that should be tackled with some intermediary file handling & pipeline tools. One idea would be to add Bio types to FileIO. Some discussion here : High level file processing · Issue #76 · BioJulia/FASTX.jl · GitHub

I wonder if you could rewrite this piece of code :

Using my BioRecordProcessing package (which is very limited but helps for that type of tasks) :

I don’t know much about RNA-seq data but AFAIK DESeq/2 is the gold standard when it comes to differential gene expression. Maybe a differential gene expression package that could use different methods (yours, DESeq, …) could be useful.

I am not an expert in RNA-Seq related analyses. From my point of view, RNA-Seq analysis was currently stick with R, mainly because several commonly used R packages, such as sva, DESeq, WGCNA … more bioinformers would turn to julia if we could implement the core codes of these packages in julia.

Yep :slight_smile:

Cool, thanks. I’m not sure how to organize this so its easy for people to give me feedback. I guess I leave it in the repo and people can open issues? (I guess I was working in a silo the last years :sweat_smile:, will try to change that now.)

That sounds like a good start. I will try to make a plan this week and write a blog post here about it.

I think I did not fully understand what you mean. I thought this is what I was trying to solve with my package. To have an abstraction layer that makes the file interfacing easier. I will check the issue you referenced, maybe then I get it :slight_smile:

Yes, definitely. I checked your code and find it way more elegant than my folder handling. The preprocess.jl file in my opinion is also very specific to my projects, i.e. the wrappers to existing tools should probably come in the form of a jll or something like this to be more general.

That sounds like a good idea. Differential gene expression would definitely attract more people. What I have implemented is basically DESeq2 light. It can do the normalization and the evaluation using a GLM but not the fancy stuff like dispersion trend estimation or bayesian shrinkage. I also think it would be cool to have more than just a copy of DESeq2 in Julia but maybe combine what different tools (edgeR, other normalization methods like cqn, ruv) are good at.

Thanks for all those great answers :slight_smile:

I’m very sorry for the delay, I need some time to read into BioJulia’s contribution guidelines, github branching models and Julia code style guidelines :slight_smile:

I have a few questions regarding the RNA-seq tutorial:

  1. Where would be an appropriate place to discuss the content of such a tutorial? (I checked the BioJulia discord server and it seemed rather quiet there)

  2. Should the tutorial rely on the exported functions from my package or should I introduce everything that is not directly from BioJulia packages?

  3. Should my tutorial replace the old one or should I put it in a separate folder? I work with prokaryotic systems and have some assumptions in my code that probably do not let it scale to eukaryotic genomes well. Maybe it makes sense to have two? But most things would still be the same, I guess.

  4. I normally read all the content from a file into memory. This makes sense in many cases for me but maybe this is not a good or general way and it would be better to have a “per record” approach in the tutorial?

1 Like

Yeah, I think slack #biology or directly on github makes the most sense.

Definitely ok to use your package. It will be a good way to get discussions going about your functions too.

The old one is quite broken. So fine to replace it, but feel free to start from scratch if you prefer. That said, I bet we can do something about scaling to Euks :wink:

Probably, eventually, but I would not overthink it. Writing something that works is better than having nothing or only a broken thing. We can iterate, improve, and work on efficiency / performance as we go.

1 Like

Hi @maltesie ,

Thanks for your interest in bioinformatics. While I’m not using RNA-seq myself, contributing to other packages would be great (XAM and the likes) for easier I/O.

I’m quite a beginner to julia but I agree with @songtaogui bioinformatics seems to lack momentum at the moment, compared to R. Maybe writing tutorials could change that.

On my side I started to work on making BioRecordsProcessing.jl more generic, the idea is to have a pipeline object that takes a source (single/paired file, folder or iterable in memory), a processing function and a sink that writes or collects the output. I don’t want to make it a full-fleshed pipeline system, but I think this would cover a lot of common operations.

using BioRecordsProcessing, FASTX, BioSequences
filename = "/test/data/illumina_full_range_as_illumina.fastq"

p = Pipeline(
    Reader(
        FASTX.FASTQ,
        File(filename)
    ),
    record -> begin
        return sequence(LongDNA{2}, record)
    end,
    Collect(LongDNA{2}),
)

run(p; max_records = 100)

To write to disk you would just change the Collect to a Writer with the appropriate record type, and for processing a folder you would just have to change the File to a Folder with a glob pattern.

1 Like

That looks cool :slight_smile: I have a bit of time over the weekend and will see where I can rewrite my code using your package.

I pushed my changes yesterday, it’s not released yet so you have to ]add BioRecordsProcessing#main. I also updated the docs : Manual · BioRecordsProcessing.jl

I got good test coverage but there might be some bugs still. In terms of features I would maybe add options to group reads in BAM and intersect them (or other indexed files) with a genomic interval.

I needed some basic dispersion estimation in a project of mine recently, and I used the formula in Marginal likelihood estimation of negative binomial parameters with applications to RNA-seq data | Biostatistics | Oxford Academic with Optim:

"""Marginal likelihood for α of a NegativeBinomial."""
function ℓnbinα(xs)
    n = length(xs)
    x̄ = mean(xs)
    α -> logbeta(n*x̄, n/α) - log(α+1) -
        sum(logbeta(x + 1, 1/α) + log(x + 1/α) for x in xs)
end

Optim.maximize(ℓnbinα(expression_values), 1e-6, 1e2).res.minimizer

Cool, thanks a lot. I will have a look and try to add it to my code.

I might have missed this being discussed somewhere already in the thread.

What about wrapping some R packages?

This could be done, though you can always use the R packages directly using RCall.