[GSoC] Parallelism in Bio.jl

bicycle1885 · March 28, 2017, 5:21am

Hi all,

I am a Ph.D. student at the University of Tokyo, Japan, and a core developer of the BioJulia project. I’d like to participate in the summer of code this year again.

The project idea I’d like to propose is introducing parallelism in BioJulia. Today’s computational biology faces the problem of growing data, and hence BioJulia developers have been careful of the design and algorithms to squeeze computational power out of a single CPU. However, we haven’t paid so much attention to parallelism in the project because enriching the functionality has had higher priority. I’ve implemented lots of new tools in Bio.jl I need this and last year, and my lab mates are starting to use it in their researches. I think it’s time to make it faster with the power of multiple cores.

The ease of use will be the highest priority. What biologists want to do is to finish their jobs faster, not to write fast but complicated code. So I think an approach like dask and Dagger.jl would be the best way to go, which parallelizes computation of delayed tasks (or thunks) using a task scheduler. I’m going to focus on single node parallelism since distributed computing in a computer cluster would be too much for a summer project.

Mentor candidates in my mind are @shashi and @ChrisRackauckas because I believe they have sophisticated knowledge of parallel computing in Julia. I appreciate comments.

Thank you.

ChrisRackauckas · March 28, 2017, 9:44am

I won’t be able to do this. I think I will have too many DiffEq projects to be able to properly give attention to this.

bicycle1885 · March 28, 2017, 1:03pm

Okay, I’m sorry but it would be impossible if you say impossible.

sdanisch · March 28, 2017, 8:25pm

For single node parallelism, you might also want to look into https://github.com/JuliaGPU/GPUArrays.jl/, which is almost ready for release!

shashi · March 29, 2017, 11:05am

Hi @bicycle1885,

Good to see your interest, as always! Do submit your proposal!

As Simon mentioned, single-node parallelism would definitely benefit from threads and GPUs. However threading is yet to come to a mature state in Julia. The good news is, it’s possible to describe parallelism using Dagger.jl and then implement a few tweaks to the scheduler to use threading instead of processes once we have support for a decent threading API in Julia. That said, we should be able to get speedups on embarassingly parallel tasks and some other tasks using Dagger as it is.

I’d like to know more about the kinds of operations that might benefit from parallelism in Bio-related projects.

If you see GPUs as the most useful form of parallelism for Bio packages, I’d suggest just using GPUArrays.jl. If out-of-core computations are required, we should also be able to use Dagger in combination with GPUArrays to achieve some kind of hybrid parallelism with a small overhead of the scheduler.

@sdanisch nice to see that GPUArrays is maturing, it’s an exciting piece of software!

bicycle1885 · April 2, 2017, 9:37am

Thank you for your comments, @sdanisch and @shashi, and I’m sorry for being the late reply.

I’m interested in GPUArrays.jl but using GPUs is not in the scope of my project because it will require tailor-made algorithms targeted for the task. What I’d like to parallelize are more general jobs (e.g. finding sequences having a particular pattern from 10,000 sequences).

There are many kinds of tasks that can be parallelized in a program. For example, we often read 100s GiB of files and do stream processing. Dagger seems to have a parallel CSV reader and I hope to parallelize reading file formats that are specific to biology as well (BGZFStreams.jl experimentally supports paralle decompossion, which is an example of parallel reading used in BioJulia). Other example is k-mer counting, which creates a histogram of short sequence patterns in a sequence. There are several softwares that support parallel counting (e.g. Jellyfish) but Bio.jl doesn’t support. If would be great if it is possible to compute k-mer counting for multiple sequences in parallel and then reduce them into a histogram using Dagger.

shashi · April 2, 2017, 6:36pm

These applications are interesting! I’m interested in mentoring this project. Do submit your proposal through the GSoC website as soon as possible!

bicycle1885 · April 3, 2017, 2:15am

Thank you! I’ve submitted a draft but I’m still blushing it up until the deadline.

Topic		Replies	Views
On the performance and design of BioSequences compared to the Seq language Community	0	404	January 25, 2020
BioJulia Package: Bio.jl Biology, Health, and Medicine package	5	2587	November 6, 2016
Julia in population/systems biology article Community	11	1700	March 29, 2018
ANN: ForestBiometrics.jl Community	9	1241	September 6, 2017
Julia Systems Biology community calls Events	0	221	December 11, 2023

[GSoC] Parallelism in Bio.jl

Related topics