Can we make a comparison/overview of different GPU computing implementations?

Hello,

I am overwhelmed by the sheer number of GPU computing libraries.
Every day i look into it i find a new implementation so far i found:

  • CuArrays
  • CUDAnative
  • ArrayFire
  • Knet.Arrays
  • CLArrays
  • AMDGPUnative
  • Vulkan

and i am sure i missed some implementations.
I find that extremely confusing. Have i missed a resource which lists and explains the differences?

If there is no such overview i welcome you to give your ideas what questions to ask for the comparison.
It is also obvious that certain libraries can not be compared a long some dimensions (speed of AMDGPUnative vs CUDAnative since they run different hardware or interface comparison between CUDAnative and CuArrays since they do different things in the same eco-system).

Any attempts at an overview which explains the role of certain packets in their eco-system and in comparison to other implementations are welcome.

3 Likes

Sure, here’s the overview I have in my head (with some extras added to clarify how the ecosystem fits together):

NVIDIA (CUDA):

  • CuArrays
    – GPUArrays implementation
  • CUDAnative
    – Codegen library, generates PTX from Julia code
  • CUDAdrv and CUDAapi
    – Supporting libraries for the above (memory management, C wrappers for CUDA C libs, etc.)

AMD (HSA/ROCm):

  • HSAArrays
    – WIP GPUArrays implementation
  • AMDGPUnative
    – Codegen library, generates GCN from Julia code
  • HSARuntime
    – C wrappers for HSA runtime, loads and launches kernels, and provides HSAArray for above packages

OpenCL/Other (covers “all” GPUs and CPUs in some fashion):

  • GPUArrays
    – Provides common functions and a framework for implementing GPU-compatible arrays and operations
  • CLArrays
    – GPUArrays implementation (BROKEN on Transpiler)
  • Transpiler
    – Generates OpenCL C kernels from Julia code (BROKEN on Julia > 0.6)
  • OpenCL
    – C wrappers to the OpenCL runtime, also integrates OpenCL events with Julia’s libuv, and other nice things
  • ArrayFire
    – Common functions (and array types?) for OpenCL (and CUDA?) GPU computing, simple to use but not always the most performant and doesn’t have the same features as the *native and Transpiler packages can provide
  • Vulkan
    – A wrapper around the Vulkan standard for doing graphics “stuff” usually with GPUs, ingests SPIR-V as its IR and lingua franca, not actively maintained in Julia (lots of work…)
  • Knet.Arrays
    – No idea! Someone please fill me in on this one :slightly_smiling_face:

Extra Details:

  • Transpiler needs some time, love, and care if we want CLArrays to work again. Totally possible to fix, just not easy.
  • AMDGPUnative will follow CUDAnative’s lead in almost every case, and the two should have similar codegen performance and behavior in the long term.
  • The user-facing packages most people will deal with are CuArrays and (eventually) HSAArrays (and maybe CLArrays again one day), or CUDAnative, AMDGPUnative, and OpenCL if you want to directly write kernels in Julia/OpenCL C.

Hopefully this helps; let me know if something is wrong or needs further clarification :smile:

11 Likes

Source: karray.jl

1 Like

Ok let me summarize for compute:

GPUArrays.jl provides an abstract interface which is implemented by

KnetArrays
CLArrays
HSAArrays
CuArrays
ArrayFire

Some of these implementations are partial and broken. Are there some general heuristics which library to pick when?

The libraries

AMDGPUnative
CUDAnative

implement the same interface. CUDAnative is more developed since it is older. OpenCL provides a level at a similar level of abstraction but is not related to the VENDORnative family.

Did i get that right?

Here are some other questions:
Is there a performance difference between passing a VENDORnative kernel GPU arrays and broadcasting a kernel over an VENDORArrays array?

using CuArrays, CUDAnative

xs, ys, zs = CuArray(rand(1024)), CuArray(rand(1024)), CuArray(zeros(1024))

function kernel_vadd(out, a, b)
  i = (blockIdx().x-1) * blockDim().x + threadIdx().x
  out[i] = a[i] + b[i]
  return
end

VS

out .= a .+ b ./ c .+ 1
#turns into this one broadcast (map):
broadcast!(out, a, b, c) do a, b, c
    a + b / c + 1
end

How does VENDORArrays decide how many threads to spawn on the GPU?
When to choose VENDORnative over VENDORArrays?
When to choose ArrayFire over the vendor ecosystem?
How compatible are CUDA and AMDGPU and why can’t they come in the same package?

When you want a cross platform implementation that works the same on CUDA or OpenCL or even CPU and when you don’t want to deal with low level details and write your own kernels.

The downside is that if you need something custom and it is not already in the library you are out of luck.

1 Like

GPUArrays.jl is actually only used by CuArrays/CLArrays. It’s also not fully abstract, i.e., one of the original design goals by @sdanisch was to be able to instantiate a GPUArray. I’ve been moving away from that design a little, and I hope to clean the package up some more in order to make it more reliable for use by other GPU back-ends (esp. the one for AMD).

Availability of hardware / software. In theory, each of these packages should just implement the Base array interface, and you should be able to write generic array code and move between implementations. Of course, we’re not quite there yet.

The current broadcast implementation has some overheads, because of how it pads dimensions (run-time checks on keeps and defaults). This works out fine for CPUs, but is somewhat costly on GPUs. We could probably customize broadcast to specialize on the container shapes, and that would be something that would fit great in GPUArrays.jl. (I’m not sure, after reading #32051)

Other than that, at least in the case of CuArrays, you end up with regular CuDeviceArrays on the device, regardless of writing your own kernel vs. broadcasting over a CuArray.

That’s an underdeveloped part of CuArrays. IIRC, it just defaults to 256 threads and pads the blocks for that (we really need to do an occupancy analysis there). For some calls, where we have expensive kernels that might use lots of registers (eg. reduce) we do something more sophisticated, but still no occupancy analysis.

When you need the flexibility of kernel programming. If you can rephrase a problem in terms of array expressions, do so. It’s much easier, you get kernel fusion out of the box, dispatch to CUBLAS and friends whenever possible, etc.

In terms of the underlying software stack: not compatible at all. Concerning the implementations, @jpsamaroo has expressed interest in sharing code between CUDA/AMDnative (and presumably the array infrastructure that would be built on top of that). I haven’t had the time to look into that, but I’m all for sharing code between back ends whenever possible.

7 Likes

@everyone Thank you for answering my question so well. I hope my questions help other people who want to get into GPU computing too.

3 Likes