Question about coalesced read and write to the global memory using CUDA.jl 2D grid

Hi,
I’m having some confusion with a simple test of matrix transpose using a 2D grid in CUDA.jl. I wrote two matrix transpose functions, the only difference is that in the first function, “ix” is along the direction of the matrix row, and “iy” is along the direction of the matrix column, and the second function is just the opposite.

What confuses me is the difference in running time by measuring the two functions. The second function runs faster. Based on my limited knowledge, I guess it has something to do with the coalesced read and write of global memory. Since julia and CUDA.jl are column major, I guess that the second function coalesced read data from global memory but does not coalesced write to global memory, and the first function is just the opposite. I don’t know if I understand correctly. Assuming launch a three-by-three thread block, the adjacent thread numbers should be (1,1), (2,1), and (3,1), and coalesced memory access will occur when they access adjacent areas of the matrix (the same column).

  • Why do the runtimes of the two functions differ, is it because uncoalesced reads are more time-consuming than uncoalesced writes? Does this mean that when programming with CUDA.jl one should pay more attention to coalescing reads than coalescing writes?

  • In addition, I wrote a third function based on the second function to use shared memory coalesced writes to speed up the function. What confuses me is why the third function runs slower than the second function?

Maybe I misunderstood or missed something. I think that uncoalesced read data in shared memory should be faster than global memory, and the third function should execute faster than the second function.

Thanks!

using CUDA,BenchmarkTools

nrow = 10000  
ncol = 10000
A = CUDA.rand(Float64,ncol,nrow)
A_T = CUDA.rand(Float64,nrow,ncol)

#first function
function MatrixTranspose1!(A,A_T)
    nrow = size(A,1)
    ncol = size(A,2)

    ix = (blockIdx().x-1)*blockDim().x+threadIdx().x 
    iy = (blockIdx().y-1)*blockDim().y+threadIdx().y 

    if (ix > nrow) || (iy > ncol)
        return
    end

    A_T[iy,ix] = A[ix,iy]
    return
end

#second function
function MatrixTranspose2!(A,A_T)
    nrow = size(A,1)
    ncol = size(A,2)

    ix = (blockIdx().x-1)*blockDim().x+threadIdx().x 
    iy = (blockIdx().y-1)*blockDim().y+threadIdx().y 

    if (ix > ncol ) || (iy > nrow)
        return
    end

    A_T[ix,iy] = A[iy,ix]
    return
end

#third function
function MatrixTranspose3!(A,A_T)
    nrow = size(A,1)
    ncol = size(A,2)

    ix = (blockIdx().x-1)*blockDim().x+threadIdx().x 
    iy = (blockIdx().y-1)*blockDim().y+threadIdx().y 

    ShareMem = @cuDynamicSharedMem(Float64, (blockDim().y,blockDim().x))
    sync_threads()

    if (ix > ncol) || (iy > nrow)
        return
    end

    ShareMem[threadIdx().y,threadIdx().x] = A[iy,ix]

    Index = threadIdx().x + blockDim().x*(threadIdx().y -1)

    Row = cld(Index,blockDim().x)

    Col = Index % blockDim().x
    if Col == 0
        Col = blockDim().x
    end

    OutX = (blockIdx().x-1)*blockDim().x+Col 
    OutY = (blockIdx().y-1)*blockDim().y+Row

    A_T[OutX,OutY] = ShareMem[Row,Col]
    return
end

@btime CUDA.@sync @cuda(
    threads = (32,32), 
    blocks = (cld(size(A,1),32),cld(size(A,2),32)), 
    MatrixTranspose1!(A,A_T)
) #time: 8.604ms

@btime CUDA.@sync @cuda(
    threads = (32,32), 
    blocks = (cld(size(A,2),32),cld(size(A,1),32)), 
    MatrixTranspose2!(A,A_T)
)#time: 6.044ms  
 #Why is it faster than the first function?

@btime CUDA.@sync @cuda(
    threads = (32,32), 
    blocks = (cld(size(A,2),32),cld(size(A,1),32)), 
    shmem=32*32*sizeof(Float64),
    MatrixTranspose3!(A,A_T)
)#time: 8.969ms
 #Why is it slower than the second function?

The general rule of thumb is that consecutive threads should access consecutive memory locations; see CuArray is Row Major or Column Major? - #2 by maleadt for a similar question.

1 Like