Falling Behind - Julia for genomics?

I just saw in BioRxiv that a group has developed a pythonic pipeline for parallel processing of 3d epigenetic profiles. Other than the rather forced attempt at alliteration and the distinct lack of creativity in naming the project, this is something that seems to better for a language that has parallel processing baked in. Has anyone thought of attempting to get a Julia based pipeline for genomic together using native parallel processing and multithreading? I have but as I like to say, I know enough to get me into trouble but not enough to get me out. It seems to me that Julia is very well positioned to push out some of the slower pipelines some written in python and some written in Java.

Java can be fast, and many python pipelines are wrapping tools written in C (or these days rust), which do the computational heavy lifting. Julia has had a number of limitations in this space, mostly (in my opinion) around the creation of command line utilities and stand-alone binaries.

I agree that Julia could be an amazing tool for bioinformatics, and would love for it to be used more widely, but I’ve come to think the “falling behind” narrative, and the worry about competition with other languages is misguided. We should just keep making cool stuff (see BioMakie.jl and SingleCellProjections.jl as stand out examples), and people will come.

10 Likes

Why Numba and Cython are not substitutes for Julia - Stochastic Lifestyle
This blogpost gives an example of how optimizing different libraries together in the same language can result in much better performance than a glue language calling separately compiled libraries in other languages, but that isn’t often necessary. Julia itself is used to wrap a lot of C libraries instead of reimplementing them. An API in a glue language with hot code compiled from other languages can be so fast that reimplementions often end up slower, often because the better algorithms are difficult but occasionally a legitimate language or compiler limitation. Just because it’s easy to make a toy example of pure Python code being slow doesn’t imply that implementing a whole library in another language alone will get better performance, it’s just too far removed from practice.

1 Like

Nice to see that linked here.

Yeah I think that’s really the key for Julia in this domain. Lots of tools in the bioinformatics pipeline are effectively random binaries tied together with scripts on top. At least when I was last doing bioinformatics, I remember the whole Tophat thing with different interfaces like Galaxy or just plain R, where people were stringing together a ton of binaries which were developed only loosely to work together. The biggest issue for performance has nothing to do with the performance of any one algorithm but the fact that each independent binary spends the majority of its time writing the results to a file and reading it into the next one. It would take some time but if someone really want to make a major improvement here, a completed version of the workflow that handles everything within memory would be a massive improvement over what I’ve seen.

Though I’ve been out of bio for about 4 years now so that may be old information, but at the same time it seemed so baked in that I’m sure the same workflow is basically set in stone…

I definitely agree with this. There’s good stuff all around in other places. That’s okay. Make sure the pipelines of yesterday exist, but choose a problem that isn’t well-solved and just work on that. Build things that are cool and people will use it, and if you have wrappers to the other things then for the most part people won’t care.

4 Likes

This could be interesting, though in a lot of cases it’s not only about speed but the ability to batch jobs on a cluster or AWS. Long-running programs with variable memory requirements are the hardest to deal with because they often get interrupted if you’re doing spot instances, or you have to chose between starting with high memory (which costs more on AWS or takes longer to queue on HPC), or using a restart strategy that has low memory and then increases on failure / resubmission.

In some cases, these considerations mean that if you can offload stuff to disk, you might end up with a faster pipeline in terms of wall clock even if compute time is slower.

1 Like

I appreciate this and I am a biologist with ideas and no appreciable skills at the level needed to pull something like this off. Im just one guy out here in the wilderness throwing out ideas and hopefully inspiring someone with vastly superior skills than myself to take up the lead and I would be happy to contribute to my meager skills

Ideas are valuable, skills are valuable. It’s hard when you have one but not the other, but no need to be self deprecating. I think the primary lack in BioJulia development at the moment is not ideas or skills, it’s simply developer time. Even people with ideas and skills often can’t get to everything they’d like to.

So I’d encourage you to try to build something without worrying about your skill level. This can be with a PR to an existing package, or a new package. Ask here or in issues for help with design or if you run into problems. This has benefits for you (actually doing stuff will improve those skills) and for the ecosystem (because hands-on-keyboard time is a very limiting resource).

8 Likes

First off, I’d like to just say that I am not being self deprecating, I am being realistic. It tends to tamp down my ambitions, not eliminate but tamp down

I have developed a few things that I use such as calculating temporal reproductive isolation for populations, and given a list organisms, query the NCBI database to find out if there is a genome of each organism and other metadata among other things. I also wrote a crude AMOVA script, that needs refinement but as you said I just don’t have the time to finish it.

I ask things all the time

1 Like