Question about coalesced read and write to the global memory using CUDA.jl 2D grid

fred_wu · April 20, 2023, 9:05am

Hi,
I’m having some confusion with a simple test of matrix transpose using a 2D grid in CUDA.jl. I wrote two matrix transpose functions, the only difference is that in the first function, “ix” is along the direction of the matrix row, and “iy” is along the direction of the matrix column, and the second function is just the opposite.

What confuses me is the difference in running time by measuring the two functions. The second function runs faster. Based on my limited knowledge, I guess it has something to do with the coalesced read and write of global memory. Since julia and CUDA.jl are column major, I guess that the second function coalesced read data from global memory but does not coalesced write to global memory, and the first function is just the opposite. I don’t know if I understand correctly. Assuming launch a three-by-three thread block, the adjacent thread numbers should be (1,1), (2,1), and (3,1), and coalesced memory access will occur when they access adjacent areas of the matrix (the same column).

Why do the runtimes of the two functions differ, is it because uncoalesced reads are more time-consuming than uncoalesced writes? Does this mean that when programming with CUDA.jl one should pay more attention to coalescing reads than coalescing writes?
In addition, I wrote a third function based on the second function to use shared memory coalesced writes to speed up the function. What confuses me is why the third function runs slower than the second function?

Maybe I misunderstood or missed something. I think that uncoalesced read data in shared memory should be faster than global memory, and the third function should execute faster than the second function.

Thanks!

using CUDA,BenchmarkTools

nrow = 10000  
ncol = 10000
A = CUDA.rand(Float64,ncol,nrow)
A_T = CUDA.rand(Float64,nrow,ncol)

#first function
function MatrixTranspose1!(A,A_T)
    nrow = size(A,1)
    ncol = size(A,2)

    ix = (blockIdx().x-1)*blockDim().x+threadIdx().x 
    iy = (blockIdx().y-1)*blockDim().y+threadIdx().y 

    if (ix > nrow) || (iy > ncol)
        return
    end

    A_T[iy,ix] = A[ix,iy]
    return
end

#second function
function MatrixTranspose2!(A,A_T)
    nrow = size(A,1)
    ncol = size(A,2)

    ix = (blockIdx().x-1)*blockDim().x+threadIdx().x 
    iy = (blockIdx().y-1)*blockDim().y+threadIdx().y 

    if (ix > ncol ) || (iy > nrow)
        return
    end

    A_T[ix,iy] = A[iy,ix]
    return
end

#third function
function MatrixTranspose3!(A,A_T)
    nrow = size(A,1)
    ncol = size(A,2)

    ix = (blockIdx().x-1)*blockDim().x+threadIdx().x 
    iy = (blockIdx().y-1)*blockDim().y+threadIdx().y 

    ShareMem = @cuDynamicSharedMem(Float64, (blockDim().y,blockDim().x))
    sync_threads()

    if (ix > ncol) || (iy > nrow)
        return
    end

    ShareMem[threadIdx().y,threadIdx().x] = A[iy,ix]

    Index = threadIdx().x + blockDim().x*(threadIdx().y -1)

    Row = cld(Index,blockDim().x)

    Col = Index % blockDim().x
    if Col == 0
        Col = blockDim().x
    end

    OutX = (blockIdx().x-1)*blockDim().x+Col 
    OutY = (blockIdx().y-1)*blockDim().y+Row

    A_T[OutX,OutY] = ShareMem[Row,Col]
    return
end

@btime CUDA.@sync @cuda(
    threads = (32,32), 
    blocks = (cld(size(A,1),32),cld(size(A,2),32)), 
    MatrixTranspose1!(A,A_T)
) #time: 8.604ms

@btime CUDA.@sync @cuda(
    threads = (32,32), 
    blocks = (cld(size(A,2),32),cld(size(A,1),32)), 
    MatrixTranspose2!(A,A_T)
)#time: 6.044ms  
 #Why is it faster than the first function?

@btime CUDA.@sync @cuda(
    threads = (32,32), 
    blocks = (cld(size(A,2),32),cld(size(A,1),32)), 
    shmem=32*32*sizeof(Float64),
    MatrixTranspose3!(A,A_T)
)#time: 8.969ms
 #Why is it slower than the second function?

maleadt · April 20, 2023, 8:02pm

The general rule of thumb is that consecutive threads should access consecutive memory locations; see CuArray is Row Major or Column Major? - #2 by maleadt for a similar question.

Topic		Replies	Views
CuArray is Row Major or Column Major? GPU	1	3819	November 30, 2017
I don't understand why it is slower with CuStaticSharedArray New to Julia gpu , cuda , sharedarrays , cudajl	9	284	March 17, 2025
Unexpected coalesced group behaviour in CUDA.jl GPU cuda	3	82	January 25, 2025
Accessing array elements too slow? GPU	10	593	April 23, 2021
Correct usage of shared memory？ GPU	5	851	January 20, 2024

Question about coalesced read and write to the global memory using CUDA.jl 2D grid

Related topics