Hi,

I apologize because my statement is wrong or at least unclear. AFAIK, there is no centralized Julia GPGPU programming guide for beginners, and it is almost impossible to learn GPGPU only from Julia based documentation.

What I wanted to say is that Julia GPU tools (especially CUDANative.jl) makes GPGPU learning much easier an efficient in different ways:

- Large reduction of the GPGPU setup cost compared to C/C++ CUDA programming where you have to make all the compilation/installation setup (environement.CMake, test of driversā¦). On my ubuntu system, once I have installed the CUDA tookit package (apt-get install nvidia-cuda-toolkit), everything works nicely from Julia.
- Nice integration with Julia native Arrays: the syntax to create and transfer a Julia Array to the GPU and the copy back to the CPU is transparent and intuitive:

```
n=1024
a=ones(Float32,n,n) # normal CPU matrix of float
d_a = CuArray(a) # copy to a GPU
copyto!(a,d_a) # copy back from CPU to GPU
```

- Compared to C/C++ dynamic language like Julia (this is also true for PyCUDA) accelerates experiments on optimal threads and block numbers.
- CUDA kernels syntax is Julia syntax for arrays (real multi-dim arrays). For example, compare the (excellent)

Mark Harris tutorials nvidia_transpose, example to its Julia translation:

```
__global__ void copySharedMem(float *odata, const float *idata)
{
__shared__ float tile[TILE_DIM * TILE_DIM];
int x = blockIdx.x * TILE_DIM + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y;
int width = gridDim.x * TILE_DIM;
for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
tile[(threadIdx.y+j)*TILE_DIM + threadIdx.x] = idata[(y+j)*width + x];
__syncthreads();
for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
odata[(y+j)*width + x] = tile[(threadIdx.y+j)*TILE_DIM + threadIdx.x];
}
```

with CUDAnative:

```
function kernel_transpose_tile_shared_bank!(target,source)
tile = @cuStaticSharedMem(Float32, (TILE_DIM+1,TILE_DIM))
x=(blockIdx().x-1)*TILE_DIM+threadIdx().x
y=(blockIdx().y-1)*TILE_DIM+threadIdx().y
for j=0:BLOCK_ROWS:(TILE_DIM-1)
tile[threadIdx().x,threadIdx().y+j]=source[x,y+j]
end
y=(blockIdx().x-1)*TILE_DIM+threadIdx().y
x=(blockIdx().y-1)*TILE_DIM+threadIdx().x
CUDAnative.sync_threads()
for j=0:BLOCK_ROWS:(TILE_DIM-1)
target[x,y+j]=tile[threadIdx().y+j,threadIdx().x]
end
return nothing
end
```

Note that with CuArrays you could call a Julia method on these GPU matrices (but it is slower)

permutedims!(target,source,(2,1))

Yesterday I made a short introduction to GPGPU for beginners (with different backgrounds C/Fortran/python/Matlab) and I found Julia to be very convenient to put the focus on fundamentals of GPGPU programming avoiding nasty details. Here is the complete code introducing different steps for explaining the implementation of a basic axpy on GPU.

- gpu_axpy_0 shows how individual GPU thread is slow
- gpu_axpy_1 shows that consecutive thread must address consecutive array elts
- gpu_axpy_2 show the improvement for correct coalesced access
- gpu_axpy_3 allows for very large number of blocks
- gpu_axpy_4 shows the final implementation:
- cpu_axpy_2 shows that you can use Julia broadcast syntax on GPU arrays

I obtain the following results:

```
CPU: cpu_axpy_2! Bandwidth:28.892053854678245 GB/s (t=0.003484117)
GPU: gpu_axpy_0! Bandwidth:0.02758891863720712 GB/s (t=3.648685812)
GPU: gpu_axpy_1! Bandwidth:1.7731071085263668 GB/s (t=0.056772259)
GPU: gpu_axpy_2! Bandwidth:6.068441898078472 GB/s (t=0.016587997)
GPU: gpu_axpy_3! Bandwidth:268.05017854337365 GB/s (t=0.000375539)
GPU: gpu_axpy_4! Bandwidth:259.418751965034 GB/s (t=0.000388034)
GPU: cpu_axpy_2! Bandwidth:266.59064871092045 GB/s (t=0.000377595)
```

In Julia I can provide all the code that you can copy and execute on your machine (with C I could only indicate a link for download an installation setup instructionsā¦).

```
using Pkg
using Test,BenchmarkTools
using CuArrays,CUDAnative
import CUDAdrv
#y=a*X+y
function cpu_axpy_1!(y,a,x)
for i=1:length(x)
y[i]+=a*x[i]
end
end
function cpu_axpy_2!(y,a,x)
y .+= a.*x
end
function kernel_axpy_0!(y,a,x,N)
for i=1:N
y[i]+=a*x[i]
end
end
function gpu_axpy_0!(y,a,x)
total_size=length(x)
@cuda blocks=1 threads=1 kernel_axpy_0!(y,a,x,total_size)
end
function kernel_axpy_1!(y,a,x,total_size,block_size)
thread_id=threadIdx().x
thread_number=blockDim().x
offset=(thread_id-1)*block_size
for i=1:block_size
y[i+offset]+=a*x[i+offset]
end
end
function gpu_axpy_1!(y,a,x)
thread_number=256
total_size=length(x)
block_size=div(total_size,thread_number)
@cuda blocks=1 threads=thread_number kernel_axpy_1!(y,a,x,total_size,block_size)
end
function kernel_axpy_2!(y,a,x,total_size)
thread_id=threadIdx().x
thread_number=blockDim().x
for i=thread_id:thread_number:total_size
y[i]+=a*x[i]
end
end
function gpu_axpy_2!(y,a,x)
thread_number=256
total_size=length(x)
@cuda blocks=1 threads=thread_number kernel_axpy_2!(y,a,x,total_size)
end
const THREAD_NUMBER=256
function kernel_axpy_3!(y,a,x)
thread_id=threadIdx().x
offset=(blockDim().x)*(blockIdx().x-1)
@inbounds y[thread_id+offset]+=a*x[thread_id+offset]
return
end
function gpu_axpy_3!(y,a,x)
nthreads=256
total_size=length(x)
(block_number,rem)=divrem(total_size,nthreads)
@assert rem==0
# @show block_number,THREAD_NUMBER,total_size
@cuda blocks=block_number threads=nthreads kernel_axpy_3!(y,a,x)
end
function kernel_axpy_4!(y,a,x,total_size)
offset=threadIdx().x+blockDim().x*(blockIdx().x-1)
stride=blockDim().x*gridDim().x
for i=offset:stride:total_size
@inbounds y[i]+=a*x[i]
end
return
end
function gpu_axpy_4!(y,a,x)
total_size=length(x)
# block_number=div(total_size+THREAD_NUMBER-1,THREAD_NUMBER)
block_number=1024
# @show block_number,THREAD_NUMBER,total_size
@cuda blocks=block_number threads=THREAD_NUMBER kernel_axpy_4!(y,a,x,total_size)
end
#2 read + 1 write of Float32 per element
bandwidth_GBs(n,ts) = n*sizeof(Float32)*3/(ts*1.e9)
function evalgflops_gpu(axpy_function,N)
y=ones(Float32,N)
yref=ones(Float32,N)
x=ones(Float32,N)
copyto!(x,1:length(x)) # x=[1 2 3 4 ..]
a=Float32(2)
cpu_axpy_1!(yref,a,x)
d_y = CuArray(y) #Copy data from CPU to GPU
d_x = CuArray(x) #Copy data from CPU to GPU
axpy_function(d_y,a,d_x)
copyto!(y,d_y) #Copy data back from GPU to CPU
@test yāyref #Check result vs reference function
#time it !
tgpu = @belapsed begin
$axpy_function($d_y,$a,$d_x)
CUDAdrv.synchronize() #try to remove this ;)
# copyto!($y,$d_y) # add this if you want to count the GPU->CPU transfer
end
finalize(d_x) # have pb with out of memory without this
finalize(d_y) # have pb with out of memory without this
println("GPU: ",string(axpy_function)," Bandwidth:$(bandwidth_GBs(N,tgpu)) GB/s (t=$tgpu)")
end
function evalgflops_cpu(axpy_function,N)
y=ones(Float32,N)
yref=ones(Float32,N)
x=ones(Float32,N)
copyto!(x,1:length(x)) # x=[1 2 3 4 ..]
a=Float32(2)
cpu_axpy_1!(yref,a,x)
axpy_function(y,a,x)
@test yāyref #Check result vs reference function
#time it !
tcpu = @belapsed begin
$axpy_function($y,$a,$x)
end
println("CPU: ",string(axpy_function)," Bandwidth:$(bandwidth_GBs(N,tcpu)) GB/s (t=$tcpu)" )
end
function tst()
# N=1024*1024*128
N=1024*1024*8
evalgflops_cpu(cpu_axpy_2!,N)
evalgflops_gpu(gpu_axpy_0!,N)
evalgflops_gpu(gpu_axpy_1!,N)
evalgflops_gpu(gpu_axpy_2!,N)
evalgflops_gpu(gpu_axpy_3!,N)
evalgflops_gpu(gpu_axpy_4!,N)
evalgflops_gpu(cpu_axpy_2!,N)
return
end
tst()
```

Finally I totally agree with you and a centralized tutorial on Julia GPU is still missing. I think that the existing notebooks should be extended.

Hope my point is clearer