How does GPU programming work (Knet example)?

Can anyone give a high-level summary of how julia and GPU programming work? In what cases can we write julia code and have it run on the GPU, and in what cases will it not work? I am confused.

I needed to clean an image computed with knet to sit in 0-1 range.
There are some existing clamp calls but for some reason they did
not work with the KnetArray datatype. I expected to be able to write some new code and have it work on the GPU since the image data is a KnetArray. I wrote this:

  function clamp!(img)
    res = size(img)
    len = reduce(*,res)
    @inbounds for i=1:len
        v = img[i]
        if v > 1.f0   
            img[i] = 1.f0 
        elseif v < 0.f0
            img[i] = 0.f0

It worked, but took about 20 seconds for a 300^2 image,
whereas one of the existing clamp calls takes a fraction of a second
when applied to a Array{Float32}. This is very slow: so slow
that I suspect data is being pulled to the CPU to do the operation
and then pushed back?

I thought that julia has a compiler that outputs GPU code,
But in looking at the source for Knet’s conv4 funtion,
it actually calls cudnn.

The situation confuses me now. Can anyone give a simple explanation of what works and does not work in terms of having julia programs run on the GPU? “simple”, meaning, for someone who does not knows about how compilers work.

When you hear people talking about running Julia code on GPUs, it’s with CUDAnative and CuArrays. I don’t think that’s compatible with KnetArrays, which are an internal datatype to KNet and utilize hardcoded CUDA kernels written in CUDA C++. You can see from the source what it’s creating:

While this could in theory be interopable with the “standard” Julia CUDA stack, I am not sure if there’s an easy way to do it.


Thank you.

Two more follow-up question:

  1. Does Flux use CuArrays?

  2. For someone who knows Knet, is there a “rule” about what julia programs working with KnetARrays will produce good gpu code and which no?


As Chris mentioned, Knet uses a combination of handcrafted kernels and cuda library calls to implement most of Julia’s array interface. As a general rule of thumb built-in functions that work on the whole array (with or without broadcasting) should work on a KnetArray (e.g. tanh.(a) or a .+ b or norm(a) etc.). Anything that requires a for loop that goes through array elements will need either a custom cuda kernel or be expressed in terms of other array functions.

Finally, you don’t have to use KnetArray with Knet, you can use CuArrays with a slight performance penalty: none of the model building, training, optimization, gradient etc code in Knet is (or should be :)) KnetArray specific.


That last fact is very interesting (that CuArrays can be also used).
I will see if that can speed up the code that I mentioned.