GPU sort WIP (GPU 1000x faster than CPU? I must be doing something wrong)

xiaodai · January 31, 2019, 12:43pm

I am working on a GPU sort using CUDAnative and CuArrays. This is my generic bitonic sort (can only sort arrays of size 2^N elements for now).

I am sorting 2^26=67_108_864 Float32s using the optimised Radixsort (CPU) vs my naive (just learned-GPU-programming) Bitonic sort.

And the result is that

CPU sort is 735ms = 0.735s
GPU sort is ~0.000750s

I got a pretty recent card in the upper middle range which is RTX 2080 (no TI).

using CuArrays, CUDAdrv, CUDAnative

# sort kernel
bisort!(shared, j, k) = begin
  tid = UInt(((blockIdx().x - 1) * blockDim().x + threadIdx().x)-1)
  ixj = tid⊻UInt(j)
  if ixj > tid
    if (tid & k) == 0
        if shared[tid+1] > shared[ixj+1]
          tmp = shared[ixj+1]
          shared[ixj+1] = shared[tid+1]
          shared[tid+1] = tmp
        end
    else
        if shared[tid+1] < shared[ixj+1]
          tmp = shared[ixj+1]
          shared[ixj+1] = shared[tid+1]
          shared[tid+1] = tmp
        end
    end
  end
  return
end

shared = rand(Float32, 2^26)

bitonicsort!(cushared) = begin
  CuArrays.@sync begin
    k = UInt(2)
    NUM = length(cushared)
    nblocks = ceil(NUM/256) |> Int
    while (k <= NUM)
      j = div(k, 2)
      while j >= 1
        @cuda threads = 256 blocks = nblocks bisort!(cushared, j, k)
        j =  div(j,2)
      end
      k = UInt(k*2)
    end
  end
end


using SortingAlgorithms
@btime sort!($shared, alg=RadixSort) #735.216

shared = rand(Float32, 2^26)

using BenchmarkTools
res = Float64[]
using Flux
for i = 1:10
  cushared = cu(shared)
  cpu(cushared) |> issorted |> println
  t = Base.@elapsed bitonicsort!(cushared)
  cpu(cushared) |> issorted |> println
  push!(res, t)
end


res
#0.203231082
#0.000757748
#0.000749437
#0.000749713
#0.000772709
#0.000769384
#0.000771878
#0.000763289
#0.00076606
#0.000767722

mohamed82008 · January 31, 2019, 12:56pm

It seems you are missing synchronize(). If the while loop iterations are sequential, the synchronization should be done after each @cuda call. Kernel calls are asynchronous so the CPU makes the call and moves on.

kristoffer.carlsson · January 31, 2019, 12:57pm

Also, at-sync had one job · Issue #268 · JuliaGPU/CuArrays.jl · GitHub

xiaodai · January 31, 2019, 1:09pm

updated my code, I didn’t need to synchronise after every @cuda call I just need to synchronise at the end seems to be sufficient.

Now the performance dropped to a very dissapointing 2.6s. Back to the drawing board. Probably need to update the algorithm to reduce the number of CUDA calls

using CuArrays, CUDAdrv, CUDAnative

# sort kernel
bisort!(shared, j, k) = begin
  tid = UInt(((blockIdx().x - 1) * blockDim().x + threadIdx().x)-1)
  ixj = tid⊻UInt(j)
  if ixj > tid
    if (tid & k) == 0
        if shared[tid+1] > shared[ixj+1]
          tmp = shared[ixj+1]
          shared[ixj+1] = shared[tid+1]
          shared[tid+1] = tmp
        end
    else
        if shared[tid+1] < shared[ixj+1]
          tmp = shared[ixj+1]
          shared[ixj+1] = shared[tid+1]
          shared[tid+1] = tmp
        end
    end
  end
  return
end

bitonicsort!(cushared) = begin
  CuArrays.@sync begin
    k = UInt(2)
    NUM = length(cushared)
    nblocks = ceil(NUM/256) |> Int
    while (k <= NUM)
      j = div(k, 2)
      while j >= 1
        @cuda threads = 256 blocks = nblocks bisort!(cushared, j, k)
        j =  div(j,2)
      end
      k = UInt(k*2)
    end
  end
end


using SortingAlgorithms, BenchmarkTools
shared = rand(Float32, 2^26)
cpusort = @belapsed sort!($shared, alg=RadixSort) #735.216

shared = rand(Float32, 2^26)

res = Float64[]
using Flux
for i = 1:10
  cushared = cu(shared)
  sorted_shared = sort(shared, alg=RadixSort)
  println("exp false;")
  println("got $(cpu(cushared) |> issorted)")
  t = Base.@elapsed begin
    bitonicsort!(cushared)
    CUDAdrv.synchronize()
  end
  xx = cpu(cushared)
  println("exp true;")
  println("got $(xx |> issorted); max error: $(1_000_000_000maximum(xx .- sorted_shared))")
  push!(res, t)
end

res
# 6.457073086
# 2.774147852
# 2.771599214
# 2.770980271
# 2.778133025
# 2.769555927
# 2.799603755
# 2.774497496
# 2.790657341
# 2.790034242

I can double the thread count per block and reduce the timing to 1.4s, but still slower than CPU radixsort.

cnliao · January 31, 2019, 1:13pm

In case you don’t already know, the cub (cuda unbound) library has an excellent implementation of radix sort in cuda. I think the bar is pretty high.

xiaodai · January 31, 2019, 1:15pm

I didn’t know that. I read about Thrust.

So those are written in CUDA C right? I am hoping to implement the same in CUDAnative and compare the performance.

Also, if there is a convenient way to call that library from Julia on CuArrays then it will be super cool too!

cnliao · January 31, 2019, 1:18pm

It’s in C++, and crazily templated. I wrapped a single function that I need in C and just ccall it. Nothing to do with Cuarrays here.

xiaodai · January 31, 2019, 1:19pm

Ok. As long as it can sort Julia arrays. But I think we will need to pass to the GPU the arrays we want to sort, so making it work with CuArrays is the way to use them in Julia.

maleadt · January 31, 2019, 1:38pm

It’s always good to have a pure-Julia version, even if there exist a much faster implementation (CUB), eg. to be compatible with element types that the other implementation does not support or is not compiled for, or to work with array views, support complex sorting conditions, etc.

cnliao · January 31, 2019, 2:00pm

I actually thought about porting cub to Julia last year, but last time I checked cudanative was lacking a few features like thread local storage, texture memory or something like that. For my mission calling one function is good enough so unfortunately I was not motivated enough to take on that.

Nonetheless cub is very very interesting and I learned a lot from reading it’s codebase. I think it showcases that a gpu can do much more that convoluting kernels. A lot of possibilities unexplored.

maleadt · January 31, 2019, 3:10pm

Yes, quite some CUDA features are missing. I’d love to spend more time on that, but there’s been very little interest in these features. So definitely open issues or ask for help if you’re interested.

xiaodai · January 31, 2019, 8:35pm

Yeah there is an atomics feature request that I will need to implement a fast countmap. I am keen to see that feature implemented snd i promise to implement coutmap for cuarrays if atomics make it to cudanative

alandion · January 31, 2019, 8:46pm

I’ve implemented atomics for cudaNative:

Topic		Replies	Views
`CUDA.quicksort` is not faster than CPU radixsort GPU performance	3	1297	July 4, 2021
GPU Sort Function GPU question , gpuarrays , sort	20	4938	April 2, 2020
CUDAnative is awesome! GPU	12	6019	December 3, 2018
Most efficient way of _waiting_ for GPU results? GPU	20	3097	January 31, 2019
Optimizing the use of Blocks, Threads vs. Array Indexing GPU	15	3306	September 21, 2018

GPU sort WIP (GPU 1000x faster than CPU? I must be doing something wrong)

Related topics