CUDAnative is awesome!


#1

Hi,

I have just finished to rewrite my FEM solver with CUDAnative.jl and I want to address a big thank you to Tim and all the contributors for this package !

The performance is on par with the native C++/CUDA implementation. CUDAnative.jl greatly accelerates the dev, debug and especially fine tuning of the CUDA parameters. nvvp works nicely in this process. I have used and appreciated PyCUDA and PyOpenCL in the past, but Julia as a host language allows to build very efficient programs containing both CPU and GPU computations.

I think CUDAnative is an excellent entry point for GPGPU programming beginners.

I have also appreciated to use CuArrays and broadcast but I did not obtain the same level of performance (probably due to my limited skills).

Since December is getting closer, I start to write my wish list :slight_smile:

  • I wonder if it would be a good idea to allow to run the CUDANative kernels on the CPU (possibly with automatic multi-threading (and maybe simd)) when Base.Arrays are used instead of CuArrays. Of course optimal CPU implementations would probably imply data layout transformation I wrote a small paper on this topic here.
  • Of course all the CUDA extensions (atomic, GPU intrinsics,…) would be a very much appreciated :wink:

Thank you again
Laurent


#2

Great, happy to hear CUDAnative has been useful and performs well :slight_smile:

I have also appreciated to use CuArrays and broadcast but I did not obtain the same level of performance (probably due to my limited skills).

Probably not; I’ve spent significant time optimizing CUDAnative and making sure the generated code quality is competitive, while CuArrays hasn’t seen much optimization…

I wonder if it would be a good idea to allow to run the CUDANative kernels on the CPU (possibly with automatic multi-threading (and maybe simd)) when Base.Arrays are used instead of CuArrays. Of course optimal CPU implementations would probably imply data layout transformation I wrote a small paper on this topic here.

I considered that in the past, but I’m not sure it’s a smart investment of our (very limited) development manpower, especially with most users nowadays relying on array abstractions where this already the case. If you’re really interested in this, it might be better to revive a project like GPUOcelot, which implements the CUDA APIs and provides a PTX->LLVM IR compiler. But it hasn’t seen any development recently.

Of course all the CUDA extensions (atomic, GPU intrinsics,…) would be a very much appreciated :wink:

Similar trade off, I prefer to work on CUDAnative but it improving CuArrays reaches more users.


#3

Thank you,
I hope I can find some time soon to experiment on a method to run of CUDA kernels on CPU. A basic version should be trivial to implement but optimized one do require some investment…

I am not sure to understand your last sentence… you will focus on CuArrays optim ?


#4

I have registered a package to support atomic operations in CUDAnative:

Ideally we’d have LLVM intrinsics allowing interop with the normal julia atomic functions, but this package just uses inline PTX and is thus CUDA-specific.

While I’m here allow me to second Laurent: CUDAnative is indeed awesome!


#5

I’d like to know how did you learn, or what tutorials have you followed to write GPU kernels. I started with OpenCL C, and now I am using Julia and CUDA. I find it difficult to find documentations on how to do things like a general work flow, launching configuration, optimizations. I mean details things such as what API, and what syntax to use. If it was CUDA C, I can refer to the programming guide from NVIDIA, but for Julia, I had to scrap bits and pieces from blogs, GitHub issues, and discourse of course to get by. But I wish there is a centralized programming guide for Julia GPU computing. If you have resources, would you mind sharing with me?


#6

Check out the examples directory in the CUDAnative repo for a start https://github.com/JuliaGPU/CUDAnative.jl/tree/master/examples. The basic API is very similar to CUDA C, but of course with all the multiple dispatch and compiler magic of Julia! If you have further questions beyond those examples, you can ask on Discourse or on the slack channel gpu with a minimal, self-contained working (or not working) example where possible.


#7

Hi,
I apologize because my statement is wrong or at least unclear. AFAIK, there is no centralized Julia GPGPU programming guide for beginners, and it is almost impossible to learn GPGPU only from Julia based documentation.

What I wanted to say is that Julia GPU tools (especially CUDANative.jl) makes GPGPU learning much easier an efficient in different ways:

  • Large reduction of the GPGPU setup cost compared to C/C++ CUDA programming where you have to make all the compilation/installation setup (environement.CMake, test of drivers…). On my ubuntu system, once I have installed the CUDA tookit package (apt-get install nvidia-cuda-toolkit), everything works nicely from Julia.
  • Nice integration with Julia native Arrays: the syntax to create and transfer a Julia Array to the GPU and the copy back to the CPU is transparent and intuitive:
  n=1024
  a=ones(Float32,n,n) # normal CPU matrix of float
  d_a = CuArray(a)    # copy to a GPU
  copyto!(a,d_a)      # copy back from CPU to GPU
  • Compared to C/C++ dynamic language like Julia (this is also true for PyCUDA) accelerates experiments on optimal threads and block numbers.
  • CUDA kernels syntax is Julia syntax for arrays (real multi-dim arrays). For example, compare the (excellent)
    Mark Harris tutorials nvidia_transpose, example to its Julia translation:
__global__ void copySharedMem(float *odata, const float *idata)
{
  __shared__ float tile[TILE_DIM * TILE_DIM];

  int x = blockIdx.x * TILE_DIM + threadIdx.x;
  int y = blockIdx.y * TILE_DIM + threadIdx.y;
  int width = gridDim.x * TILE_DIM;

  for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
     tile[(threadIdx.y+j)*TILE_DIM + threadIdx.x] = idata[(y+j)*width + x];

  __syncthreads();

  for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
     odata[(y+j)*width + x] = tile[(threadIdx.y+j)*TILE_DIM + threadIdx.x];          
}

with CUDAnative:

function kernel_transpose_tile_shared_bank!(target,source)

    tile = @cuStaticSharedMem(Float32, (TILE_DIM+1,TILE_DIM))

    x=(blockIdx().x-1)*TILE_DIM+threadIdx().x
    y=(blockIdx().y-1)*TILE_DIM+threadIdx().y

    for j=0:BLOCK_ROWS:(TILE_DIM-1)
        tile[threadIdx().x,threadIdx().y+j]=source[x,y+j]
    end

    y=(blockIdx().x-1)*TILE_DIM+threadIdx().y
    x=(blockIdx().y-1)*TILE_DIM+threadIdx().x


    CUDAnative.sync_threads()

    for j=0:BLOCK_ROWS:(TILE_DIM-1)
        target[x,y+j]=tile[threadIdx().y+j,threadIdx().x]
    end
    return nothing
end

Note that with CuArrays you could call a Julia method on these GPU matrices (but it is slower)

permutedims!(target,source,(2,1))

Yesterday I made a short introduction to GPGPU for beginners (with different backgrounds C/Fortran/python/Matlab) and I found Julia to be very convenient to put the focus on fundamentals of GPGPU programming avoiding nasty details. Here is the complete code introducing different steps for explaining the implementation of a basic axpy on GPU.

  • gpu_axpy_0 shows how individual GPU thread is slow
  • gpu_axpy_1 shows that consecutive thread must address consecutive array elts
  • gpu_axpy_2 show the improvement for correct coalesced access
  • gpu_axpy_3 allows for very large number of blocks
  • gpu_axpy_4 shows the final implementation:
  • cpu_axpy_2 shows that you can use Julia broadcast syntax on GPU arrays :wink:

I obtain the following results:

CPU: cpu_axpy_2! Bandwidth:28.892053854678245 GB/s (t=0.003484117)
GPU: gpu_axpy_0! Bandwidth:0.02758891863720712 GB/s (t=3.648685812)
GPU: gpu_axpy_1! Bandwidth:1.7731071085263668 GB/s (t=0.056772259)
GPU: gpu_axpy_2! Bandwidth:6.068441898078472 GB/s (t=0.016587997)
GPU: gpu_axpy_3! Bandwidth:268.05017854337365 GB/s (t=0.000375539)
GPU: gpu_axpy_4! Bandwidth:259.418751965034 GB/s (t=0.000388034)
GPU: cpu_axpy_2! Bandwidth:266.59064871092045 GB/s (t=0.000377595)

In Julia I can provide all the code that you can copy and execute on your machine (with C I could only indicate a link for download an installation setup instructions…).

using Pkg

using Test,BenchmarkTools
using CuArrays,CUDAnative
import CUDAdrv

#y=a*X+y

function cpu_axpy_1!(y,a,x)
    for i=1:length(x)
        y[i]+=a*x[i]
    end
end

function cpu_axpy_2!(y,a,x)
    y .+= a.*x
end

function kernel_axpy_0!(y,a,x,N)
    for i=1:N
        y[i]+=a*x[i]
    end
end

function gpu_axpy_0!(y,a,x)
    total_size=length(x)
    @cuda blocks=1 threads=1 kernel_axpy_0!(y,a,x,total_size)
end

function kernel_axpy_1!(y,a,x,total_size,block_size)
    thread_id=threadIdx().x
    thread_number=blockDim().x
    offset=(thread_id-1)*block_size
    for i=1:block_size
        y[i+offset]+=a*x[i+offset]
    end
end

function gpu_axpy_1!(y,a,x)
    thread_number=256
    total_size=length(x)
    block_size=div(total_size,thread_number)
    @cuda blocks=1 threads=thread_number kernel_axpy_1!(y,a,x,total_size,block_size)
end

function kernel_axpy_2!(y,a,x,total_size)
    thread_id=threadIdx().x
    thread_number=blockDim().x
    for i=thread_id:thread_number:total_size
        y[i]+=a*x[i]
    end
end

function gpu_axpy_2!(y,a,x)
    thread_number=256
    total_size=length(x)
    @cuda blocks=1 threads=thread_number kernel_axpy_2!(y,a,x,total_size)
end


const THREAD_NUMBER=256

function kernel_axpy_3!(y,a,x)

    thread_id=threadIdx().x
    offset=(blockDim().x)*(blockIdx().x-1)

    @inbounds y[thread_id+offset]+=a*x[thread_id+offset]
    return
end

function gpu_axpy_3!(y,a,x)
    nthreads=256
    total_size=length(x)
    (block_number,rem)=divrem(total_size,nthreads)
    @assert rem==0
    # @show block_number,THREAD_NUMBER,total_size
    @cuda blocks=block_number threads=nthreads kernel_axpy_3!(y,a,x)
end


function kernel_axpy_4!(y,a,x,total_size)
    offset=threadIdx().x+blockDim().x*(blockIdx().x-1)
    stride=blockDim().x*gridDim().x

    for i=offset:stride:total_size
        @inbounds y[i]+=a*x[i]
    end
    return
end

function gpu_axpy_4!(y,a,x)
    total_size=length(x)
    # block_number=div(total_size+THREAD_NUMBER-1,THREAD_NUMBER)
    block_number=1024
    # @show block_number,THREAD_NUMBER,total_size
    @cuda blocks=block_number threads=THREAD_NUMBER kernel_axpy_4!(y,a,x,total_size)
end




#2 read + 1 write of Float32 per element
bandwidth_GBs(n,ts) = n*sizeof(Float32)*3/(ts*1.e9)

function evalgflops_gpu(axpy_function,N)
    y=ones(Float32,N)
    yref=ones(Float32,N)
    x=ones(Float32,N)
    copyto!(x,1:length(x)) # x=[1 2 3 4 ..]
    a=Float32(2)
    cpu_axpy_1!(yref,a,x)

    d_y = CuArray(y) #Copy data from CPU to GPU
    d_x = CuArray(x) #Copy data from CPU to GPU
    axpy_function(d_y,a,d_x)
    copyto!(y,d_y) #Copy data back from GPU to CPU
    @test y≈yref #Check result vs reference function

    #time it !
    tgpu = @belapsed begin
        $axpy_function($d_y,$a,$d_x)
        CUDAdrv.synchronize() #try to remove this ;)
        # copyto!($y,$d_y) # add this if you want to count the GPU->CPU transfer
    end

    finalize(d_x) # have pb with out of memory without this
    finalize(d_y) # have pb with out of memory without this


    println("GPU: ",string(axpy_function)," Bandwidth:$(bandwidth_GBs(N,tgpu)) GB/s (t=$tgpu)")
end
function evalgflops_cpu(axpy_function,N)
    y=ones(Float32,N)
    yref=ones(Float32,N)
    x=ones(Float32,N)
    copyto!(x,1:length(x)) # x=[1 2 3 4 ..]
    a=Float32(2)
    cpu_axpy_1!(yref,a,x)

    axpy_function(y,a,x)
    @test y≈yref #Check result vs reference function

    #time it !
    tcpu = @belapsed begin
        $axpy_function($y,$a,$x)
    end

    println("CPU: ",string(axpy_function)," Bandwidth:$(bandwidth_GBs(N,tcpu)) GB/s (t=$tcpu)" )
end



function tst()

   # N=1024*1024*128
   N=1024*1024*8
   evalgflops_cpu(cpu_axpy_2!,N)

   evalgflops_gpu(gpu_axpy_0!,N)
   evalgflops_gpu(gpu_axpy_1!,N)
   evalgflops_gpu(gpu_axpy_2!,N)
   evalgflops_gpu(gpu_axpy_3!,N)
   evalgflops_gpu(gpu_axpy_4!,N)
   evalgflops_gpu(cpu_axpy_2!,N)


   return

end

tst()

Finally I totally agree with you and a centralized tutorial on Julia GPU is still missing. I think that the existing notebooks should be extended.

Hope my point is clearer :slight_smile:


#8

Apart from the fact that CUDAnative’s documentation is severely lacking, it has indeed never been the goal to explain GPGPU programming basics there. The APIs should be very similar though, so @cuchxq if you were to go through a book like “CUDA by example” but using CUDAdrv/CUDAnative instead, that would probably be a good way to learn CUDA programming.

Sorry, missed that question. I’ve been spending some time on CuArrays.jl indeed, and recent improvements to CUDAnative have been mainly driven by the needs of that package.


#9

Yep, I do not complain about that. The community should work collectively on building this documentation on GPGPU with Julia. I will try to contribute.


#10

When I tried this example, I got the error message “CUDAdrv is not configured”.


#11

Please use a separate thread, or file an issue, with more details (versions, MWE, etc).


#12

I agree that the APIs are very similar, once you see it in both languages. However, I can’t know what should I write in Julia when just knowing how to write it in C, there should be at least some cross reference, if not documentation itself, however I do think a documentation would be more clear and self-contained. The point here is not to learn how to write CUDA program, it is how to help users transition to Julia CUDA program.

I agree that the community should work collectively on the documentation. I will contribute as well.

I didn’t mean to derail the topic of this thread. I fully agree that ‘CUDAnative’ is awesome and has eliminated a lot of house keeping code for GPU computing. Thanks Tim for doing an awesome job!


#13

Thanks for sharing! I agree the talk from Tim on JuliaCon 2017 is very helpful. But that was Julia 0.6. A lot of code has changed since Julia 0.7.