How to run many copies of a random function in parallel on GPU?

Consider this “embarrassingly parallel” and fairly general problem:

  1. You have a random function f (such as a random number generator, or a sampling procedure, or a run of a Markov Chain, or a Jump Process) that is implemented in some Julia Package X and
  2. You want to run f several times (to either draw many samples, or to plot a histogram of outputs, or to observe the different trajectories of a Markov Chain/ a Jump Process).

So question:

Is there a way to “lift” code written for CPU (for example say function f implemented in Julia package X) and “wrap” it up inside some @macro so that many copies of it will run independently on GPU?

The above is just trying to exploit the parallelism that different runs of f are independent. That is, we can have each block of the GPU to run its own copy of f and no block needs to talk with any other block.

Clearly there can be more sophisticated approaches which involve:

  1. Sharing computation between blocks to save computations whenever possible.
  2. Optimizing the implementation of f so that it doesn’t make unnecessary memory calls etc.

Actually point no-2 in the above is very important! If the implementation of f (say, in Julia package X) is poor then, it might cancel out any gains that one might get from using a GPU.

But for now if we ignore all the above sophistication and do this dumb thing first:

Is there a way to “lift” code written for CPU (for example say function f implemented in Julia package X) and “wrap” it up inside some @macro so that many copies of it will run independently on GPU?

You’re question is too generic to be answerable. Did you look at the CUDA.jl documentation? The @cuda macro is basically what you describe; you pass it a kernel function where you do scalar things, differentiating between each invocation by using indexing intrinsics (threadIdx() etc). That said, GPUs aren’t magical parallel machines, you can’t throw arbitrary code and tell it to run it in a massively parallel fashion. You can get some nice speed-ups if your code is a simple element-wise operation (e.g. broadcast), but beyond that you’ll need to think about how the hardware works in order to make the individual threads co-operate, or to optimize memory operations (e.g. cache in thread- or processor-local memory).

1 Like

Thanks! I was not aware of the @cuda macro. Thanks - I now know what to dig deeper into. From a quick read of the documentation, it says:

The @cuda macro should prefix a call, with func a callable function or object that should return nothing.

What is a “callable function”? Also you use the word “kernel function”? Can it be any function or is it function written in some special way?

And where can I find some examples of using the @cuda macro?

Sorry, if I sound too ignorant, but I mean no disrespect to all the hard work and love, that would have gone into making CUDA.jl. For my purposes, if I can get away without understanding how a GPU works, I would pick that route. And again, no disrespect intended. For example, I don’t understand how Julia compilers work (LLVM and those details).

About my question being too generic, would it help if I shared a more specific version? Can you please take a look and suggest me a plan of attack?

Here is a quick sketch of what I am trying to do:

  1. I am running a stochastic simulation of a reaction network using the package Catalyst.jl. Here is an example of code from the documentation of Catalyst.jl that does such a simulation.
  2. Everything until the line sol = solve(jprob, SSAStepper(), saveat=10.) is just defining the model and setting up the problem. The actual solving starts with sol = solve(jprob, SSAStepper(), saveat=10.).
  3. So I was hoping that I would be able to spawn many instances of solve(jprob, SSAStepper(), saveat=10.) on a GPU that can run independently.

I created a github issue on the Catalyst package’s repository for this. From the discussions there, here is an useful comment that says:

[…] our SSAs are implemented in JumpProcesses.jl, so we could work on making those work on GPUs at least in the pure mass action case. If you want to take a look at those and make PRs (or even suggestions via issues) we can see about what can be done to get some of them working on GPUs in serial.

I want to solve the above github issue. Would you be able to assist/ advice me in the process?

p.s.: Do you think I should create a new topic for this discussion or we can continue to talk here?

You won’t be able to run arbitrary code like that on the GPU. GPU kernels need to be relatively simple, for example, have a look at this introductory tutorial from the CUDA.jl docs: Introduction · CUDA.jl

If you want more complicated operations to work, without the experience to write your own kernels, you’ll be better off relying on the array abstractions we provide. These are vectorized operations (think map, reduce, scan, sort) that have been implemented in a parallel manner, and use the GPU efficiently. If it’s possible to rephrase your problem in terms of such operations, that will be the easier way to use the GPU.

If that doesn’t work you’ll need to look into writing your own kernel, but there’s a lot of caveats: you cannot use dynamic dispatch, GC allocations, non-isbits types, etc.

1 Like

Great! Thanks - this is the perfect high-level advice I needed right now. Alright so I do have to invest into understanding how to write own kernels! Thanks.