[blog post] Introduction to GPU programming

sdanisch · October 18, 2018, 4:23pm

Hi,

for anyone looking forward to a simple guide to programming the GPU, here it is:

An Introduction to GPU Programming in Julia

You can run the code online, without installing anything or needing a GPU! I hope that lowers the usually high entry barrier quite a bit

Best,
Simon

non-Jedi · October 18, 2018, 4:35pm

Very cool. Might be worth adding a note that CLArrays at least isn’t yet working on Julia 1.0 I think. I fired up a REPL to try to follow along but wasn’t able to because I don’t have an NVidia GPU ]add CLArrays failed to find a version compatible with Julia 1.0.

sdanisch · October 18, 2018, 4:43pm

Ah yeah sorry - I should! Well, you can directly execute & edit the code cells in the article, so I hope that gets you going

foobar_lv2 · October 18, 2018, 6:20pm

Very cool post. Minor hickup:

Array with mutable elements is basically a linked list,

→

Array with mutable elements is basically an array of pointers

xiaodai · October 18, 2018, 9:40pm

Pretty awesome stuff. One thing that isn’t clear is that is writing a loop fast for GPU or does the operation have to be vectorized for it to be fast?

Basically ‘sum’ will be fast but is writing a loop to sum every element of the loop as fast. I caan test this particular case but are there principles that i van follow in general?
Also how do i invoke the GPU’s sorting functions to sort a large vector?

affans · October 19, 2018, 3:42am

I have access to two Tesla K80s… Never got a chance to use them… THanks for this post, I’ll try it out on our cluster.

sdanisch · October 19, 2018, 10:38am

Any function you write for the gpu will get executed in parallel, there is almost no opting out of that. So a most basic kernel call on the gpu is in fact already a loop, since it will always get scheduled to be executed a couple of times in parallel with a different invocation index. So this sets up a different context for the whole “do I need vectorization” discussion. Of course vectorization is a very easy way to profit from this execution model, but you might as well write a gpu function that uses loops and is fast - as long as you can run that function a couple of 100 times in parallel!
So you could indeed write a loop over a large array in parallel, and have smaller sum loops inside the big loop and its fast

Also how do i invoke the GPU’s sorting functions to sort a large vector?

Implement it Or wrap a library that already does it. Radix sort would be a nice start - there should be plenty of open source gpu kernel one could build uppon!
Contributions are more than welcome

datnamer · October 19, 2018, 2:44pm

@sdanisch , do you know about this question in the linked thread:

" Yes but can you do that from Julia without writing CUDA/OpenCL code?

I’m happy that Julia supports GPU programming for simple code, but I don’t see how you can run algorithms with inter-thread communication."

sdanisch · October 19, 2018, 2:59pm

I answered at hackernews:
https://news.ycombinator.com/item?id=18255560

xiaodai · October 22, 2018, 2:03am

I am confused about this. So I can write alternating sum function as below

function altsum(x)
  res = zero(typeof(x))
  d = 1.0
  for x1 in x
    res = res + d1*x1
    d = d * - 1.0
  end
end

when I run that what gets executed in parallel? The inner loop? What sort of thinking do I have to do when thinking about GPUs?

using CuArrays, GPUArrays
x = rand(1_000_000)
gpux = CuArray(x)
@time altsum(x)
@time altsum(gpux)

sdanisch · October 22, 2018, 9:53am

You could actually read the blogpost, where I explain exactly that in the section Writing GPU Kernels

using GPUArrays, CuArrays
# Overloading the Julia Base map! function for GPUArrays
function Base.map!(f::Function, A::GPUArray, B::GPUArray)
    # our function that will run on the gpu
    function kernel(state, f, A, B)
        # If launch parameters aren't specified, linear_index gets the index
        # into the Array passed as second argument to gpu_call (`A`)
        i = linear_index(state)
    		if i <= length(A)
          @inbounds A[i] = f(B[i])
        end
        return
    end
    # call kernel on the gpu
    gpu_call(kernel, A, (f, A, B))
end

Let’s try to figure out what this is doing! In simple terms, this will call the julia function kernel length(A) times in parallel on the GPU. Each parallel invocation of kernel has a thread index, which we can use to safely index into the arrays A and B. If we calculated our own indices instead of using linear_index, we’d need to make sure that we don’t have multiple threads reading and writing to the same array locations. So, if we wrote this in pure Julia with threads, an equivalent version would look like this:

xiaodai · October 22, 2018, 10:14am

I don’t think i can comprehend it. I did read it but it’s so far away from anyhthing i’ve come across. Maybe you have worked gpus for so long you dont understand what others dont understand.

Am i right in saying that you suggests that we shouldn’t do that loop? And instead use a kernel? It’s not obvious to me how that section relate to my question.

xiaodai · October 22, 2018, 10:26am

Maybe i should summarise where i am confused. I get the feeling from the post that any function that cpu works on isbits arrays will work onthe GPU. I also get that when you call a function on the gpu it calls it “multiple times in parallel”. So when i call a function with a loop in it, does it run the function multiple times or it somehow figures out that it should parallelize the loop. Or is writing a loop inside a function a bad way to program GPUs?

xiaodai · October 22, 2018, 11:29am

Dont mean to be rude. Just frustrated with my own inability understand it…

sdanisch · October 22, 2018, 11:51am

Any kernel you want to call needs to get scheduled N times in parallel - so that’s the implicit loop you won’t get rid of. Inside the kernel you can use whatever you want, including loops which then get run in parallel

somehow figures out that it should parallelize the loop

No, you will need to remove the loop you want to parallelize and use the implicit loop via the parallel invocations.

xiaodai · December 20, 2018, 2:33pm

The code where it uses Flux.jl to fit the NMist dataset using CuArrays.jl no longer works. And I can’t remix the journal and it gives an error

No route mapping for route in c.n.editor/page

:create-article

Topic		Replies	Views
Tutorial on GPU programming on julia GPU	5	5854	March 19, 2019
Presentation on effective use of CUDAnative/CuArrays GPU	7	2371	January 3, 2019
JuliaUserGroupMunich: Tim Besard, GPU Programming in Julia: What, Why and How? Meetups announcement , gpu	4	574	December 15, 2023
From CPU to GPU and back - compatible code for both New to Julia	3	749	December 29, 2018
Bug in CUDA, CuArray, or something I just don't know? GPU	3	264	December 25, 2022

[blog post] Introduction to GPU programming

No route mapping for route in c.n.editor/page

Related topics