[blog post] Introduction to GPU programming

Any function you write for the gpu will get executed in parallel, there is almost no opting out of that. So a most basic kernel call on the gpu is in fact already a loop, since it will always get scheduled to be executed a couple of times in parallel with a different invocation index. So this sets up a different context for the whole “do I need vectorization” discussion. Of course vectorization is a very easy way to profit from this execution model, but you might as well write a gpu function that uses loops and is fast - as long as you can run that function a couple of 100 times in parallel!
So you could indeed write a loop over a large array in parallel, and have smaller sum loops inside the big loop and its fast :wink:

Also how do i invoke the GPU’s sorting functions to sort a large vector?

Implement it :frowning: Or wrap a library that already does it. Radix sort would be a nice start - there should be plenty of open source gpu kernel one could build uppon!
Contributions are more than welcome :slight_smile: