GPU sort WIP (GPU 1000x faster than CPU? I must be doing something wrong)

sort

#1

I am working on a GPU sort using CUDAnative and CuArrays. This is my generic bitonic sort (can only sort arrays of size 2^N elements for now).

I am sorting 2^26=67_108_864 Float32s using the optimised Radixsort (CPU) vs my naive (just learned-GPU-programming) Bitonic sort.

And the result is that

  • CPU sort is 735ms = 0.735s
  • GPU sort is ~0.000750s

I got a pretty recent card in the upper middle range which is RTX 2080 (no TI).

using CuArrays, CUDAdrv, CUDAnative

# sort kernel
bisort!(shared, j, k) = begin
  tid = UInt(((blockIdx().x - 1) * blockDim().x + threadIdx().x)-1)
  ixj = tid⊻UInt(j)
  if ixj > tid
    if (tid & k) == 0
        if shared[tid+1] > shared[ixj+1]
          tmp = shared[ixj+1]
          shared[ixj+1] = shared[tid+1]
          shared[tid+1] = tmp
        end
    else
        if shared[tid+1] < shared[ixj+1]
          tmp = shared[ixj+1]
          shared[ixj+1] = shared[tid+1]
          shared[tid+1] = tmp
        end
    end
  end
  return
end

shared = rand(Float32, 2^26)

bitonicsort!(cushared) = begin
  CuArrays.@sync begin
    k = UInt(2)
    NUM = length(cushared)
    nblocks = ceil(NUM/256) |> Int
    while (k <= NUM)
      j = div(k, 2)
      while j >= 1
        @cuda threads = 256 blocks = nblocks bisort!(cushared, j, k)
        j =  div(j,2)
      end
      k = UInt(k*2)
    end
  end
end


using SortingAlgorithms
@btime sort!($shared, alg=RadixSort) #735.216

shared = rand(Float32, 2^26)

using BenchmarkTools
res = Float64[]
using Flux
for i = 1:10
  cushared = cu(shared)
  cpu(cushared) |> issorted |> println
  t = Base.@elapsed bitonicsort!(cushared)
  cpu(cushared) |> issorted |> println
  push!(res, t)
end


res
#0.203231082
#0.000757748
#0.000749437
#0.000749713
#0.000772709
#0.000769384
#0.000771878
#0.000763289
#0.00076606
#0.000767722

#2

It seems you are missing synchronize(). If the while loop iterations are sequential, the synchronization should be done after each @cuda call. Kernel calls are asynchronous so the CPU makes the call and moves on.


#3

Also, https://github.com/JuliaGPU/CuArrays.jl/issues/268


#4

updated my code, I didn’t need to synchronise after every @cuda call I just need to synchronise at the end seems to be sufficient.

Now the performance dropped to a very dissapointing 2.6s. Back to the drawing board. Probably need to update the algorithm to reduce the number of CUDA calls

using CuArrays, CUDAdrv, CUDAnative

# sort kernel
bisort!(shared, j, k) = begin
  tid = UInt(((blockIdx().x - 1) * blockDim().x + threadIdx().x)-1)
  ixj = tid⊻UInt(j)
  if ixj > tid
    if (tid & k) == 0
        if shared[tid+1] > shared[ixj+1]
          tmp = shared[ixj+1]
          shared[ixj+1] = shared[tid+1]
          shared[tid+1] = tmp
        end
    else
        if shared[tid+1] < shared[ixj+1]
          tmp = shared[ixj+1]
          shared[ixj+1] = shared[tid+1]
          shared[tid+1] = tmp
        end
    end
  end
  return
end

bitonicsort!(cushared) = begin
  CuArrays.@sync begin
    k = UInt(2)
    NUM = length(cushared)
    nblocks = ceil(NUM/256) |> Int
    while (k <= NUM)
      j = div(k, 2)
      while j >= 1
        @cuda threads = 256 blocks = nblocks bisort!(cushared, j, k)
        j =  div(j,2)
      end
      k = UInt(k*2)
    end
  end
end


using SortingAlgorithms, BenchmarkTools
shared = rand(Float32, 2^26)
cpusort = @belapsed sort!($shared, alg=RadixSort) #735.216

shared = rand(Float32, 2^26)

res = Float64[]
using Flux
for i = 1:10
  cushared = cu(shared)
  sorted_shared = sort(shared, alg=RadixSort)
  println("exp false;")
  println("got $(cpu(cushared) |> issorted)")
  t = Base.@elapsed begin
    bitonicsort!(cushared)
    CUDAdrv.synchronize()
  end
  xx = cpu(cushared)
  println("exp true;")
  println("got $(xx |> issorted); max error: $(1_000_000_000maximum(xx .- sorted_shared))")
  push!(res, t)
end

res
# 6.457073086
# 2.774147852
# 2.771599214
# 2.770980271
# 2.778133025
# 2.769555927
# 2.799603755
# 2.774497496
# 2.790657341
# 2.790034242

I can double the thread count per block and reduce the timing to 1.4s, but still slower than CPU radixsort.


#5

In case you don’t already know, the cub (cuda unbound) library has an excellent implementation of radix sort in cuda. I think the bar is pretty high.


#6

I didn’t know that. I read about Thrust.

So those are written in CUDA C right? I am hoping to implement the same in CUDAnative and compare the performance.

Also, if there is a convenient way to call that library from Julia on CuArrays then it will be super cool too!


#7

It’s in C++, and crazily templated. I wrapped a single function that I need in C and just ccall it. Nothing to do with Cuarrays here.


#8

Ok. As long as it can sort Julia arrays. But I think we will need to pass to the GPU the arrays we want to sort, so making it work with CuArrays is the way to use them in Julia.


#9

It’s always good to have a pure-Julia version, even if there exist a much faster implementation (CUB), eg. to be compatible with element types that the other implementation does not support or is not compiled for, or to work with array views, support complex sorting conditions, etc.


#10

I actually thought about porting cub to Julia last year, but last time I checked cudanative was lacking a few features like thread local storage, texture memory or something like that. For my mission calling one function is good enough so unfortunately I was not motivated enough to take on that.

Nonetheless cub is very very interesting and I learned a lot from reading it’s codebase. I think it showcases that a gpu can do much more that convoluting kernels. A lot of possibilities unexplored.


#11

Yes, quite some CUDA features are missing. I’d love to spend more time on that, but there’s been very little interest in these features. So definitely open issues or ask for help if you’re interested.


#12

Yeah there is an atomics feature request that I will need to implement a fast countmap. I am keen to see that feature implemented snd i promise to implement coutmap for cuarrays if atomics make it to cudanative


#13

I’ve implemented atomics for cudaNative: