Understanding the transition from memory bound to compute bound in ParallelStencil.jl

Hi, I am interested in using ParallelStencil.jl to perform finite differencing based simulations. I started by doing experiments with the acoustic waves app, and comparing CPU vs GPU run times. I found that for the stock app, my GPU is about 10X faster than the CPU, which makes sense to me if I am memory bound comparing the memory bandwidth of these devices.

Next I built my own custom wave app that uses some advanced numerical methods to allow for irregularly shaped domains, which uses many more floating point operations per grid cell. I found that for this app, the GPU was now only about 5X faster than the CPU.

Lastly, I built a new app meant to run a much more complicated system, for the purpose of this post, we can consider it 10 wave equations at once. This uses much more memory, and has a lot more floating point operations. I found that now the GPU is just as fast as the CPU.

I am wondering if this makes generic sense to the community? As I increase the amount of floating point operations per grid cell, do you expect CPU to win eventually? Or is it possible my app is not optimal, and could stand to get increased performance on GPU if I put in the effort?


Hi, thanks for reaching out! It’s hard to draw conclusion without numbers, but in general I would say that if your GPU perf drops to the perf close to one of your CPU, it’s most likely that the GPU implementation is not optimal anymore.

Taking a server “Tesla” GPU and standard multi-core server CPU, one can expect the GPU to have roughly an order of magnitude larger memory bandwidth and sustain more than an order of magnitude larger arithmetic intensity (FLOPS). So shifting from memory bound to compute bound should not change the overall picture, and may actually rather further increase the speed-up on the GPU.

Possibly, the “more complex” code also puts much more pressure on the memory bus given that you may actually need to read in much more numbers to sustain more arithmetic operation (unless some recursive operation is performed on same data).

About potential reasons causing slow down on GPUs one could list:

  • heavy use of math operations (floating-point exponent, logs, etc…)
  • excessive branching (heavy use of if ... else conditions)
  • sub-optimal kernel launch parameters which could e.g. lead to register spill to global memory, and other side effects

One way to nail things down would be to carefully start from the fats optimal MWE code, and investigate reasons for perf drop on GPU after each addition of complexity.

You may have a look at following resources and see if some useful bits could be applied:

1 Like

Thanks for your reply. What this means to me is that there potentially is some improvement to be had. I will try to put in some effort to that effect in time, I’ll take your advice on a MWE code and see where the bottlenecks are. On the other hand, there might not be much gain to be had, as for the things you pointed out:

  • I would say there is indeed heavy use of math operations, such as matrix inversion, multiplication, and root solving, at every grid point.
  • There is also quite a bit of branching going on (~30 or so conditional statements per grid cell)
  • no idea what the third one means to be honest, I’m not well versed in GPU lingo

I will take a look at those lectures, thanks.

Please look at this AMD Lab Notes post which discusses register spilling

1 Like