cuArrays vs CUDANative


#1

Hi,
From a 2D Array of floats b(nx,ny), I want to compute a(nx,ny-1) on the GPU as:

function julia_diff_y(a,b)
    s=size(a)
    for j=1:s[2]
        for i=1:s[1]
            @inbounds a[i,j]=b[i,j+1]-b[i,j]
        end
    end
end

I obtain very good performances with CUDANative.jl and a speed-up (x13) which corresponds to the memory bandwidth ratio between CPU and GPU:

 ((nx, ny), (nblocks, thread_per_block)) = ((65536, 256), (512, 128))
CPU time:0.008354672 s
GPU time 0.000565216 s
SpeedUp: x14.781379063901454
CPU Bandwidth:16.064990702208295 GB/s
GPU Bandwidth:237.46271722739323 GB/s

Is it possible to implement this computation with cuArrays ?

Laurent

Here is the MWE:

using Test,BenchmarkTools
using CuArrays,CUDAnative
import CUDAdrv

function julia_diff_y(a,b)
    s=size(a)
    for j=1:s[2]
        for i=1:s[1]
            @inbounds a[i,j]=b[i,j+1]-b[i,j]
        end
    end
end
function kernel_diff_y(a,b)
    s=size(a)
    nj=s[2]
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    @inbounds bp=b[i,1]
    for j=1:nj
        @inbounds bp1=b[i,j+1]
        @inbounds a[i,j]=bp1-bp
        bp=bp1
    end
    return nothing
end

function tst(N;thread_per_block=128)
    nx=N^2
    ny=N
    nblocks,rem=divrem(N^2,thread_per_block)
    @assert rem==0

    a=ones(Float32,nx,ny-1)
    b=ones(Float32,nx,ny)
    copyto!(b,1:length(b))

    d_a = CuArray(a)
    d_b = CuArray(b)

    @show (nx,ny),(nblocks,thread_per_block)
    #warm-up
    @cuda blocks=nblocks threads=thread_per_block kernel_diff_y(d_a, d_b)
    CUDAdrv.synchronize()
    tgpu = CUDAdrv.@elapsed begin
        @cuda blocks=nblocks threads=thread_per_block kernel_diff_y(d_a, d_b)
    end
    julia_diff_y(a,b) #warm-up
    tcpu= @elapsed julia_diff_y(a,b)
    println("CPU time:$tcpu s")
    println("GPU time $tgpu s")
    println("SpeedUp: x$(tcpu/tgpu)")
    # @show tgpu,tcpu,speedupGPU
    bandwidth_GBs(nx,ny,ts) = nx*ny*sizeof(Float32)*2/(ts*1.e9)
    println("CPU Bandwidth:$(bandwidth_GBs(nx,ny,tcpu)) GB/s")
    println("GPU Bandwidth:$(bandwidth_GBs(nx,ny,tgpu)) GB/s")
    # @show bandwidth_GBs(nx,ny,tgpu),bandwidth_GBs(nx,ny,tcpu)
    #check result
    julia_diff_y(a,b)
    @test a ≈ Array(d_a)

end

tst(256)

#2

d_a .= @view(d_b[:, 2:end]) .- @view(d_b[:, 1:end-1])

Modified driver (actually using BenchmarkTools):

function tst(N;thread_per_block=128)
    nx = N^2
    ny = N

    a = ones(Float32,nx,ny-1)
    b = ones(Float32,nx,ny)
    copyto!(b,1:length(b))

    println(@benchmark begin
        julia_diff_y($a,$b)
    end)

    d_a = CuArray(a)
    d_b = CuArray(b)
    CuArrays.allowscalar(false)

    nblocks,rem = divrem(N^2,thread_per_block)
    @assert rem == 0
    println(@benchmark begin
        @cuda blocks=$nblocks threads=$thread_per_block kernel_diff_y($d_a, $d_b)
        CUDAdrv.synchronize()
    end)

    @test a ≈ Array(d_a)

    println(@benchmark begin
        $d_a .= @view($d_b[:, 2:end]) .- @view($d_b[:, 1:end-1])
        CUDAdrv.synchronize()
    end)

    @test a ≈ Array(d_a)
end

Timings on my GPU:

Trial(11.149 ms) # CPU
Trial(688.821 μs) # CUDAnative
Trial(1.496 ms) # CuArrays

Twice as slow, which is expected: broadcast has some overhead (bounds checking, possibly padding the dimensions of heterogeneous containers, etc), and you’re only performing a single operation. If your operation were more complex (which is easy with broadcast, since you can fuse now) the timings would be comparable.


#3

Thank you very much again !
It is very compact indeed ! I will fuse more operations in the broadcast expression.
I confess that the broadcast style is not yet very intuitive for me. I have to practice :wink:

I would like to point out to casual reader that the broadcast version, which is compact and easy to read, can be applied to both cpu and gpu arrays which can lead to a drastic reduction of the maintenance costs. On my config (with an old 780TI and a more recent I6700K) I obtain (with the initial setup):

((nx, ny), (nblocks, thread_per_block)) = ((65536, 256), (512, 128))
CPU time:0.008324528 s
GPU time 0.00056332804 s
SpeedUp: x14.777407547477248
CPU Bandwidth:16.123163739733954 GB/s
GPU Bandwidth:238.258561536756 GB/s
GPU broadcast time 0.0014812159 s
SpeedUp broadcast: x5.620063761245886
GPU broadcast Bandwidth:90.6132082503125 GB/s
CPU broadcast time 0.008135392 s
CPU broadcast Bandwidth:16.498003113553242 GB/s

On the CPU, what may be obvious for native Julia developers, is that the broadcast version does not imply any performance penalty compared to the loop based version.
The performance penalty is more important on my GPU card (x2.7) but:

  • I assume no padding (size is a multiple of the thread number)
  • I have tuned manually the CUDA parameters
  • I had to manually express trivial optimization in the cuda kernel:
    If I replace:
 @inbounds bp=b[i,1]
    for j=1:nj
        @inbounds bp1=b[i,j+1]
        @inbounds a[i,j]=bp1-bp
        bp=bp1
    end

by the natural

for j=1:nj
    @inbounds a[i,j]=b[i,j+1]-b{i,j]
end

then the bs are read twice from the VRAM and there 30% of the performance is lost…


#4

To be fair, there is a single annoyance right now: the views are allocating, see https://github.com/JuliaLang/julia/issues/14955. In the case of GPU computations that cost is often dwarfed by eg. the cost to launch a kernel, but it has prevented use of broadcast in performance-sensitive parts of some libraries (ForwardDiff and DiffEq come to mind).