I got a piece of code to optimize. It is the column reduce algorithm, which takes a 2d array and sums the first axis, then write to the 1d array as output. I have my code but it is sloww. Does anyone have good ideas on improving that within the framework of CUDAnative?
function col_reduce!( lx::Int64, ly::Int64, mat_in,mat_out) sdata = @cuStaticSharedMem(Complex128,1024) tid = threadIdx().x # row index, folded over grids # column index i = tid if i<=lx && blockIdx().y<=ly # initialize mat_out[blockIdx().y] = Complex128(0.0) sdata[tid] = mat_in[i,blockIdx().y] i += blockDim().x while i<=lx sdata[tid] += mat_in[i,blockIdx().y] i += blockDim().x end sync_threads() # do reduction in shared mem s = blockDim().x ÷ 2 while s > 0 if ((tid-1) < s) sdata[tid] += sdata[tid+s] end sync_threads() s = (s÷2) end # write result if tid==1 mat_out[blockIdx().y] = sdata end end return nothing end