How to get started with GPU programming? OpenCL or CUDA?


#1

Hello,
Thank you for taking the time to read my post, I really appreciate it! I am currently a Windows using, C# developer, but have been learning Julia for some data research and personal projects. I want to learn how to modify my applications to use a GPU, but I am stuck right at the beginning.

My current setup is, Windows 10 + an AMD R9 290. Which makes me lean towards OpenCL.jl, but it seems that there is a lot more support for CUDA in general, and I really like how the CUDAnative.jl package sounds. While I program in C#, not having to write my kernels in C/C++ would be a bonus to me. I’m hesitant to switch to a NVIDIA GPU, as my end application is financial models, so I am assuming that I will want to use double precision?? and AMD consumer cards seem to excel here.

So I guess my main questions are:

  1. If you do a lot of GPU programming, is there no way around writing kernels in C/C++, so CUDA vs OpenCL is just a different API?
  2. If OpenCL uses LLVM, can’t Julia compile to it directly? Could an OpenCLnative.jl be made? (sorry if this is a novice/stupid question)
  3. AMD ROCm (http://developer.amd.com/tools-and-sdks/radeon-open-compute-platform/) what is this? Seems to be a subset of C++ that compiles (LLVM) your code differently depending on your hardware? Is this useful or a compilation path a Julia package could do? This is open source, could Julia wrap it maybe?
  4. Even if it isn’t Julia based, is there a good tutorial or course you would recommend to get started? I don’t mind if it is in Linux.
  5. Half, single, and double precision - how important are these speeds, and are they application/field dependent? Does double precision use more GPU memory, and thus that could be the bottleneck?

Thank you for reading,
Thomas


#2

Depending on what is most important to you, the choice between NVIDIA and OpenCL is simple:

Scenario 1: "I want my code to run anywhere, not only on hardware provided by a specific vendor.
Answer: OpenCL

Scenario 2: "I want to hack a single graphics card as much as I can plus I don’t care if the software doesn’t work on other hardware."
Answer: NVIDIA

Regardless of the scenario you fit in, the good news is that people are working on packages and higher-level abstractions in Julia that aim to leverage GPUs in a portable manner. Hopefully, they will see your question and give you a more detailed explanation of what is going on in JuliaGPU. I can tell you they are doing a great job!

Following.


#3

Do you have special hardware for this? NVIDIA Tesla GPUs? Most GPUs are crippled for double precision, so you should only use that if you have the correct hardware. In most cases, consumer GPUs have 32x slower double precision than single precision. GPU memory does matter a lot, but this throughput difference is more likely to be the problem.

What kind of financial models? Optimization problems? Machine learning? Those are very robust to using single precision. SDE models? Those can use single precision, but in many cases do better with double precision.

Not necessarily. Many problems can be handled exclusively by GPUArrays.jl

Theory can happen is a long way from “works great!”

See the CUDArt.jl README.


#4

Do know that CUDArt isn’t that well maintained (with quite some known bugs & incompatibilities), and other packages like CUDAdrv and CuArrays are where new features are being developed. But CUDArt’s documentation is still better, and it has some more features.

As @ChrisRackauckas mentions, you might be able to express your computations with array packages like GPUArrays or CuArrays. Alternatively, if you need raw performance or flexibility you need to write your own kernels, which you currently can only do natively from Julia with CUDAnative.jl

OpenCL tooling is IMO too fragmented for us to target it natively. Check out Transpiler.jl for a source-to-source alternative, ie. compile Julia to OpenCL code.


#5

Thank you all for your responses so far :grinning:

This seems like a good reason to continue to explore OpenCL.

I got lucky having an R9 290, 4800sp & 600dp GFLOPs (1/8 ratio). Which is one reason why I did not want to switch to NVIDIA hardware, as I would need to get a Titan or Tesla for similar dp performance.

Model wise, at this point I’m not really sure what I would need to run. But for now probably Monte Carlo/optimization for portfolio distribution & risk analysis, seem like the low-hanging fruit.

So I could get started by use GPUArrays.jl, as it will take care of the backend stuff; CUDAnative.jl for NVIDIA, and Transpiler.jl for AMD. Then if I need further performance I could look into writing my own kernels.

Thanks for helping to clarify things for a beginner!
-Thomas


#6

AMD offers indeed very good double precision performance for most of the GPU. But that’s still for the R9 a ratio of 1/8. Much better than what nvidia gives you besides for a couple of selected cards (even a professional cards for ~7000$(M6000) gives you a 1/34 perf ratio), but it’s obviously still quite a bit slower (source).

Sadly, @maleadt is right about OpenCLnative, but maybe we can target ROCm, which also has a LLVM based IR. We still need to evaluate the maturity of that tool chain - if it’s in a good state, @vchuravy might soonish start working on it.

Meanwhile, Transpiler.jl is a good stop gap solution.
When you program with GPUArrays, you can just write normal julia functions, feed them to gpu_call and depending on what backend you choose it will use Transpiler.jl or CUDAnative.
Let me write down some documentation for gpu_call here and add it later to the repository:

#Signature, global_size == cuda blocks, local size == cuda threads
gpu_call(kernel::Function, DispatchDummy::GPUArray, args::Tuple, global_size = length(DispatchDummy), local_size = nothing)
with kernel looking like this:

function kernel(state, arg1, arg2, arg3) # args get splatted into the kernel call
    # state gets always passed as the first argument and is needed to offer the same 
    # functionality across backends, even though they have very different ways of of getting e.g. the thread index
    # arg1 can be any gpu array - this is needed to dispatch to the correct intrinsics.
    # if you call gpu_call without any further modifications to global/local size, this should give you a linear index into 
    # DispatchDummy
    idx = linear_index(state, arg1::GPUArray) 
    arg1[idx] = arg2[idx] + arg3[idx]
    return #kernel must return void
end

A few words about the Julia code that will work:

  • working with immutable isbits (not containing pointers) type should be completely supported
  • non allocating code (no stuff like x = [1, 2, 3]… Note tuples are isbits, so this works x = (1, 2, 3))
  • Transpiler/OpenCL has problems with putting GPU arrays on the gpu into a struct - so no views and actually no multidimensional indexing -.- for that size is needed which would need to be part of the array struct. I’m working on a fix for that!

#7

“Which is one reason why I did not want to switch to NVIDIA hardware, as I would need to get a Titan or Tesla for similar dp performance.”

There are workarounds if you do not have good dp performance. I’m not sure if the double-double trick or this implementation:

works for GPUs also (the concept also work as “single-single”). It should(?) but maybe GPUs have performance issues?

[Playstation 3 (if I recall, with the Cell), has sp and dp, just way better sp) had a BLAS implementation, I believe this trick, to get dp results with sp.]


#8

I flagged this conversation while I was travelling, so I’m coming late to the game. I agree with the answers above and, of course, @sdanisch is one of the Julia experts on GPU programming.

I’d just add that, given your above answers to the comments,

  1. Stick with OpenCL for now. CUDA will potentially give marginally better performance (it will on an nvidia card) but you have a faster card right now which does not support CUDA.
  2. Julia removes most of the harder parts of doing OpenCL. The OpenCL package does most of the verbose C stuff for you, you just need to write a kernel.
  3. Don’t be afraid of kernel writing, it’s basically the core of your for loop. It’s the sending stuff (memory) to and from the device that’s difficult and Julia’s OpenCL package will do this for you.
  4. The Julia abstractions of some core GPU functionality are useful (GPUArrays, etc) but should be largely treated as a separate approach to the writing of a kernel [I’m giving starting-out advice here, mixing the paradigms is going to get confusing]
  5. Your comments about LLVM code generation are correct, but this appear as Vulkan, the update to OpenCL, which we’re all hoping for. So I wouldn’t waste too much time trying to get it to work with the current Julia/llvm system, that will be a big change and much more worth targetting.

If it can be of any help, please help yourself liberally to code from two workshops that I ran this year: