Optimizing column reduce with CUDAnative

Hi everyone,

I got a piece of code to optimize. It is the column reduce algorithm, which takes a 2d array and sums the first axis, then write to the 1d array as output. I have my code but it is sloww. Does anyone have good ideas on improving that within the framework of CUDAnative?

the code

function col_reduce!(
    lx::Int64,
    ly::Int64,
    mat_in,mat_out)
    sdata = @cuStaticSharedMem(Complex128,1024)
    tid = threadIdx().x
    # row index, folded over grids
    # column index
    i = tid
    if i<=lx && blockIdx().y<=ly
        # initialize
        mat_out[blockIdx().y] = Complex128(0.0)
        sdata[tid] = mat_in[i,blockIdx().y]
        i += blockDim().x
        while i<=lx
            sdata[tid] += mat_in[i,blockIdx().y]
            i += blockDim().x
        end
        sync_threads()
        # do reduction in shared mem
        s = blockDim().x ÷ 2
        while s > 0
            if ((tid-1) < s)
                sdata[tid] += sdata[tid+s]
            end
            sync_threads()
            s = (s÷2)
        end
        # write result
        if tid==1
            mat_out[blockIdx().y] = sdata[1]
        end
    end
    return nothing
end

Anyone have suggestions on libraries for doing this?

This isn’t really a place for generic GPU optimization advise… Also, for future reference, please post runnable/benchmarkable examples. You can’t expect people to reverse-engineer sensible launch conditions, especially on a GPU where problem sizes matter a lot.

What’s slow about your code? Did you profile the code? How did you try optimizing it? Implementing an efficient reduce/scan isn’t trivial, and often GPU specific. You could have a look at the reduce example in CUDAnative, or the GPU vendor agnostic one in GPUArrays (previously in CuArrays). Or check out CUDA resources like Chapter 39. Parallel Prefix Sum (Scan) with CUDA

Hi Tim, thanks for the info, I am reading that !
By “runnable/benchmarkable examples” you mean the entire code with test? this is the kernel function and I assumed that people can run @cuda (Blah) col_reduce!(…) by themselves. Alright I will give entire code later. This has already been optimised to a certain point. I followed this slide

and my version corresponds to reduction#3 in the slide. I felt difficult to implement further optimisations (with CUDAnative of course!) advised in the slide, and also I do not really understand how device-specific the reductions are.

Do you have plans for implementation of a flexible block reduction / column reduction in your library later on?

The kind of microoptimizations as detailed in that presentation are pretty sensitive to the kind of GPU you are working with, and as far as I can see that resource stems from the CUDA 1.1 era so I’m not sure how up to date it is. Be sure to profile and measure every step, and I’d recommend to start from the existing CUDAnative reduce example which has been implemented according to a more recent set of performance guidelines (the blogpost linked above).

Do you have plans for implementation of a flexible block reduction / column reduction in your library later on?

No. CUDAnative focuses on the compiler side of stuff, ie. Julia and CUDA features. The examples are there to demonstrate functionality, and development of higher-level abstractions mostly takes place over at CuArrays.jl

#NOTE example call:
# @cuda threads=1024 blocks=LY REDUCE_DIM1_OF_2D_ARRAY(lx,LY,array2d,result)
function REDUCE_DIM1_OF_2D_ARRAY(lx::Int64,ly::Int64,array2d,result)
	#// perform first level of reduction,
    #// reading from global memory, writing to shared memory
	sdata = @cuStaticSharedMem(ComplexF64,1024)
	tid = threadIdx().x #<=1024
	sdata[tid] = ComplexF64(0)
	tid_s = threadIdx().x
	if blockIdx().x <= ly
		#// we reduce multiple elements per thread.  The number is determined by the
		#// number of active thread blocks (via gridSize).  More blocks will result
		#// in a larger gridSize and therefore fewer elements per thread
		#NOTE accumulate over "grid"
		#NOTE for general values of lx
		while tid_s <= lx
			@inbounds sdata[tid] += array2d[tid_s,blockIdx().x];
			tid_s += 1024
		end
		sync_threads()
		#// do reduction in shared mem
		if tid<=512
			@inbounds sdata[tid] += sdata[tid + 512]
		end
		sync_threads()
		if tid<=256
			@inbounds sdata[tid] += sdata[tid + 256]
		end
		sync_threads()
		if tid<=128
			@inbounds sdata[tid] += sdata[tid + 128]
		end
		sync_threads()
		if tid<=64
			@inbounds sdata[tid] += sdata[tid + 64]
		end
		sync_threads()
		if tid<=32
			@inbounds sdata[tid] += sdata[tid + 32]
		end
		sync_threads()
		if tid<=16
			@inbounds sdata[tid] += sdata[tid + 16]
			sync_threads()
			@inbounds sdata[tid] += sdata[tid + 8]
			sync_threads()
			@inbounds sdata[tid] += sdata[tid + 4]
			sync_threads()
			@inbounds sdata[tid] += sdata[tid + 2]
			sync_threads()
			@inbounds sdata[tid] += sdata[tid + 1]
			sync_threads()
		end
		sync_threads()
		# // write result for this block to global mem
		if tid==1
            @inbounds result[blockIdx().x] = sdata[1]
        end
	end
	return nothing
end

reduce6 in the slide