[blog post] Introduction to GPU programming


for anyone looking forward to a simple guide to programming the GPU, here it is:

An Introduction to GPU Programming in Julia

You can run the code online, without installing anything or needing a GPU! I hope that lowers the usually high entry barrier quite a bit :slight_smile:



Very cool. Might be worth adding a note that CLArrays at least isn’t yet working on Julia 1.0 I think. I fired up a REPL to try to follow along but wasn’t able to because I don’t have an NVidia GPU ]add CLArrays failed to find a version compatible with Julia 1.0.

Ah yeah sorry - I should! Well, you can directly execute & edit the code cells in the article, so I hope that gets you going :slight_smile:

Very cool post. Minor hickup:

Array with mutable elements is basically a linked list,


Array with mutable elements is basically an array of pointers


Pretty awesome stuff. One thing that isn’t clear is that is writing a loop fast for GPU or does the operation have to be vectorized for it to be fast?

Basically ‘sum’ will be fast but is writing a loop to sum every element of the loop as fast. I caan test this particular case but are there principles that i van follow in general?
Also how do i invoke the GPU’s sorting functions to sort a large vector?

1 Like

I have access to two Tesla K80s… Never got a chance to use them… THanks for this post, I’ll try it out on our cluster.

1 Like

Any function you write for the gpu will get executed in parallel, there is almost no opting out of that. So a most basic kernel call on the gpu is in fact already a loop, since it will always get scheduled to be executed a couple of times in parallel with a different invocation index. So this sets up a different context for the whole “do I need vectorization” discussion. Of course vectorization is a very easy way to profit from this execution model, but you might as well write a gpu function that uses loops and is fast - as long as you can run that function a couple of 100 times in parallel!
So you could indeed write a loop over a large array in parallel, and have smaller sum loops inside the big loop and its fast :wink:

Also how do i invoke the GPU’s sorting functions to sort a large vector?

Implement it :frowning: Or wrap a library that already does it. Radix sort would be a nice start - there should be plenty of open source gpu kernel one could build uppon!
Contributions are more than welcome :slight_smile:

@sdanisch , do you know about this question in the linked thread:

" Yes but can you do that from Julia without writing CUDA/OpenCL code?

I’m happy that Julia supports GPU programming for simple code, but I don’t see how you can run algorithms with inter-thread communication."

I answered at hackernews:

I am confused about this. So I can write alternating sum function as below

function altsum(x)
  res = zero(typeof(x))
  d = 1.0
  for x1 in x
    res = res + d1*x1
    d = d * - 1.0

when I run that what gets executed in parallel? The inner loop? What sort of thinking do I have to do when thinking about GPUs?

using CuArrays, GPUArrays
x = rand(1_000_000)
gpux = CuArray(x)
@time altsum(x)
@time altsum(gpux)

You could actually read the blogpost, where I explain exactly that in the section Writing GPU Kernels :stuck_out_tongue:

using GPUArrays, CuArrays
# Overloading the Julia Base map! function for GPUArrays
function Base.map!(f::Function, A::GPUArray, B::GPUArray)
    # our function that will run on the gpu
    function kernel(state, f, A, B)
        # If launch parameters aren't specified, linear_index gets the index
        # into the Array passed as second argument to gpu_call (`A`)
        i = linear_index(state)
    		if i <= length(A)
          @inbounds A[i] = f(B[i])
    # call kernel on the gpu
    gpu_call(kernel, A, (f, A, B))

Let’s try to figure out what this is doing! In simple terms, this will call the julia function kernel length(A) times in parallel on the GPU. Each parallel invocation of kernel has a thread index, which we can use to safely index into the arrays A and B. If we calculated our own indices instead of using linear_index, we’d need to make sure that we don’t have multiple threads reading and writing to the same array locations. So, if we wrote this in pure Julia with threads, an equivalent version would look like this:

I don’t think i can comprehend it. I did read it but it’s so far away from anyhthing i’ve come across. Maybe you have worked gpus for so long you dont understand what others dont understand.

Am i right in saying that you suggests that we shouldn’t do that loop? And instead use a kernel? It’s not obvious to me how that section relate to my question.

Maybe i should summarise where i am confused. I get the feeling from the post that any function that cpu works on isbits arrays will work onthe GPU. I also get that when you call a function on the gpu it calls it “multiple times in parallel”. So when i call a function with a loop in it, does it run the function multiple times or it somehow figures out that it should parallelize the loop. Or is writing a loop inside a function a bad way to program GPUs?

Dont mean to be rude. Just frustrated with my own inability understand it…

Any kernel you want to call needs to get scheduled N times in parallel - so that’s the implicit loop you won’t get rid of. Inside the kernel you can use whatever you want, including loops which then get run in parallel :wink:

somehow figures out that it should parallelize the loop

No, you will need to remove the loop you want to parallelize and use the implicit loop via the parallel invocations.


The code where it uses Flux.jl to fit the NMist dataset using CuArrays.jl no longer works. And I can’t remix the journal and it gives an error

No route mapping for route in c.n.editor/page