Very cool. Might be worth adding a note that CLArrays at least isnāt yet working on Julia 1.0 I think. I fired up a REPL to try to follow along but wasnāt able to because I donāt have an NVidia GPU ]add CLArrays failed to find a version compatible with Julia 1.0.
Pretty awesome stuff. One thing that isnāt clear is that is writing a loop fast for GPU or does the operation have to be vectorized for it to be fast?
Basically āsumā will be fast but is writing a loop to sum every element of the loop as fast. I caan test this particular case but are there principles that i van follow in general?
Also how do i invoke the GPUās sorting functions to sort a large vector?
Any function you write for the gpu will get executed in parallel, there is almost no opting out of that. So a most basic kernel call on the gpu is in fact already a loop, since it will always get scheduled to be executed a couple of times in parallel with a different invocation index. So this sets up a different context for the whole ādo I need vectorizationā discussion. Of course vectorization is a very easy way to profit from this execution model, but you might as well write a gpu function that uses loops and is fast - as long as you can run that function a couple of 100 times in parallel!
So you could indeed write a loop over a large array in parallel, and have smaller sum loops inside the big loop and its fast
Also how do i invoke the GPUās sorting functions to sort a large vector?
Implement it Or wrap a library that already does it. Radix sort would be a nice start - there should be plenty of open source gpu kernel one could build uppon!
Contributions are more than welcome
You could actually read the blogpost, where I explain exactly that in the section Writing GPU Kernels
using GPUArrays, CuArrays
# Overloading the Julia Base map! function for GPUArrays
function Base.map!(f::Function, A::GPUArray, B::GPUArray)
# our function that will run on the gpu
function kernel(state, f, A, B)
# If launch parameters aren't specified, linear_index gets the index
# into the Array passed as second argument to gpu_call (`A`)
i = linear_index(state)
if i <= length(A)
@inbounds A[i] = f(B[i])
end
return
end
# call kernel on the gpu
gpu_call(kernel, A, (f, A, B))
end
Letās try to figure out what this is doing! In simple terms, this will call the julia function kernel length(A) times in parallel on the GPU. Each parallel invocation of kernel has a thread index, which we can use to safely index into the arrays A and B. If we calculated our own indices instead of using linear_index, weād need to make sure that we donāt have multiple threads reading and writing to the same array locations. So, if we wrote this in pure Julia with threads, an equivalent version would look like this:
I donāt think i can comprehend it. I did read it but itās so far away from anyhthing iāve come across. Maybe you have worked gpus for so long you dont understand what others dont understand.
Am i right in saying that you suggests that we shouldnāt do that loop? And instead use a kernel? Itās not obvious to me how that section relate to my question.
Maybe i should summarise where i am confused. I get the feeling from the post that any function that cpu works on isbits arrays will work onthe GPU. I also get that when you call a function on the gpu it calls it āmultiple times in parallelā. So when i call a function with a loop in it, does it run the function multiple times or it somehow figures out that it should parallelize the loop. Or is writing a loop inside a function a bad way to program GPUs?
Any kernel you want to call needs to get scheduled N times in parallel - so thatās the implicit loop you wonāt get rid of. Inside the kernel you can use whatever you want, including loops which then get run in parallel
somehow figures out that it should parallelize the loop
No, you will need to remove the loop you want to parallelize and use the implicit loop via the parallel invocations.