Hi everyone,
I got a piece of code to optimize. It is the column reduce algorithm, which takes a 2d array and sums the first axis, then write to the 1d array as output. I have my code but it is sloww. Does anyone have good ideas on improving that within the framework of CUDAnative?
the code
function col_reduce!(
lx::Int64,
ly::Int64,
mat_in,mat_out)
sdata = @cuStaticSharedMem(Complex128,1024)
tid = threadIdx().x
# row index, folded over grids
# column index
i = tid
if i<=lx && blockIdx().y<=ly
# initialize
mat_out[blockIdx().y] = Complex128(0.0)
sdata[tid] = mat_in[i,blockIdx().y]
i += blockDim().x
while i<=lx
sdata[tid] += mat_in[i,blockIdx().y]
i += blockDim().x
end
sync_threads()
# do reduction in shared mem
s = blockDim().x ÷ 2
while s > 0
if ((tid-1) < s)
sdata[tid] += sdata[tid+s]
end
sync_threads()
s = (s÷2)
end
# write result
if tid==1
mat_out[blockIdx().y] = sdata[1]
end
end
return nothing
end
Anyone have suggestions on libraries for doing this?
This isn’t really a place for generic GPU optimization advise… Also, for future reference, please post runnable/benchmarkable examples. You can’t expect people to reverse-engineer sensible launch conditions, especially on a GPU where problem sizes matter a lot.
What’s slow about your code? Did you profile the code? How did you try optimizing it? Implementing an efficient reduce/scan isn’t trivial, and often GPU specific. You could have a look at the reduce example in CUDAnative, or the GPU vendor agnostic one in GPUArrays (previously in CuArrays). Or check out CUDA resources like Chapter 39. Parallel Prefix Sum (Scan) with CUDA
Hi Tim, thanks for the info, I am reading that !
By “runnable/benchmarkable examples” you mean the entire code with test? this is the kernel function and I assumed that people can run @cuda (Blah) col_reduce!(…) by themselves. Alright I will give entire code later. This has already been optimised to a certain point. I followed this slide
and my version corresponds to reduction#3 in the slide. I felt difficult to implement further optimisations (with CUDAnative of course!) advised in the slide, and also I do not really understand how device-specific the reductions are.
Do you have plans for implementation of a flexible block reduction / column reduction in your library later on?
The kind of microoptimizations as detailed in that presentation are pretty sensitive to the kind of GPU you are working with, and as far as I can see that resource stems from the CUDA 1.1 era so I’m not sure how up to date it is. Be sure to profile and measure every step, and I’d recommend to start from the existing CUDAnative reduce example which has been implemented according to a more recent set of performance guidelines (the blogpost linked above).
Do you have plans for implementation of a flexible block reduction / column reduction in your library later on?
No. CUDAnative focuses on the compiler side of stuff, ie. Julia and CUDA features. The examples are there to demonstrate functionality, and development of higher-level abstractions mostly takes place over at CuArrays.jl
#NOTE example call:
# @cuda threads=1024 blocks=LY REDUCE_DIM1_OF_2D_ARRAY(lx,LY,array2d,result)
function REDUCE_DIM1_OF_2D_ARRAY(lx::Int64,ly::Int64,array2d,result)
#// perform first level of reduction,
#// reading from global memory, writing to shared memory
sdata = @cuStaticSharedMem(ComplexF64,1024)
tid = threadIdx().x #<=1024
sdata[tid] = ComplexF64(0)
tid_s = threadIdx().x
if blockIdx().x <= ly
#// we reduce multiple elements per thread. The number is determined by the
#// number of active thread blocks (via gridSize). More blocks will result
#// in a larger gridSize and therefore fewer elements per thread
#NOTE accumulate over "grid"
#NOTE for general values of lx
while tid_s <= lx
@inbounds sdata[tid] += array2d[tid_s,blockIdx().x];
tid_s += 1024
end
sync_threads()
#// do reduction in shared mem
if tid<=512
@inbounds sdata[tid] += sdata[tid + 512]
end
sync_threads()
if tid<=256
@inbounds sdata[tid] += sdata[tid + 256]
end
sync_threads()
if tid<=128
@inbounds sdata[tid] += sdata[tid + 128]
end
sync_threads()
if tid<=64
@inbounds sdata[tid] += sdata[tid + 64]
end
sync_threads()
if tid<=32
@inbounds sdata[tid] += sdata[tid + 32]
end
sync_threads()
if tid<=16
@inbounds sdata[tid] += sdata[tid + 16]
sync_threads()
@inbounds sdata[tid] += sdata[tid + 8]
sync_threads()
@inbounds sdata[tid] += sdata[tid + 4]
sync_threads()
@inbounds sdata[tid] += sdata[tid + 2]
sync_threads()
@inbounds sdata[tid] += sdata[tid + 1]
sync_threads()
end
sync_threads()
# // write result for this block to global mem
if tid==1
@inbounds result[blockIdx().x] = sdata[1]
end
end
return nothing
end
reduce6 in the slide