Hi,
I’m having some confusion with a simple test of matrix transpose using a 2D grid in CUDA.jl. I wrote two matrix transpose functions, the only difference is that in the first function, “ix” is along the direction of the matrix row, and “iy” is along the direction of the matrix column, and the second function is just the opposite.
What confuses me is the difference in running time by measuring the two functions. The second function runs faster. Based on my limited knowledge, I guess it has something to do with the coalesced read and write of global memory. Since julia and CUDA.jl are column major, I guess that the second function coalesced read data from global memory but does not coalesced write to global memory, and the first function is just the opposite. I don’t know if I understand correctly. Assuming launch a threebythree thread block, the adjacent thread numbers should be (1,1), (2,1), and (3,1), and coalesced memory access will occur when they access adjacent areas of the matrix (the same column).

Why do the runtimes of the two functions differ, is it because uncoalesced reads are more timeconsuming than uncoalesced writes? Does this mean that when programming with CUDA.jl one should pay more attention to coalescing reads than coalescing writes?

In addition, I wrote a third function based on the second function to use shared memory coalesced writes to speed up the function. What confuses me is why the third function runs slower than the second function?
Maybe I misunderstood or missed something. I think that uncoalesced read data in shared memory should be faster than global memory, and the third function should execute faster than the second function.
Thanks!
using CUDA,BenchmarkTools
nrow = 10000
ncol = 10000
A = CUDA.rand(Float64,ncol,nrow)
A_T = CUDA.rand(Float64,nrow,ncol)
#first function
function MatrixTranspose1!(A,A_T)
nrow = size(A,1)
ncol = size(A,2)
ix = (blockIdx().x1)*blockDim().x+threadIdx().x
iy = (blockIdx().y1)*blockDim().y+threadIdx().y
if (ix > nrow)  (iy > ncol)
return
end
A_T[iy,ix] = A[ix,iy]
return
end
#second function
function MatrixTranspose2!(A,A_T)
nrow = size(A,1)
ncol = size(A,2)
ix = (blockIdx().x1)*blockDim().x+threadIdx().x
iy = (blockIdx().y1)*blockDim().y+threadIdx().y
if (ix > ncol )  (iy > nrow)
return
end
A_T[ix,iy] = A[iy,ix]
return
end
#third function
function MatrixTranspose3!(A,A_T)
nrow = size(A,1)
ncol = size(A,2)
ix = (blockIdx().x1)*blockDim().x+threadIdx().x
iy = (blockIdx().y1)*blockDim().y+threadIdx().y
ShareMem = @cuDynamicSharedMem(Float64, (blockDim().y,blockDim().x))
sync_threads()
if (ix > ncol)  (iy > nrow)
return
end
ShareMem[threadIdx().y,threadIdx().x] = A[iy,ix]
Index = threadIdx().x + blockDim().x*(threadIdx().y 1)
Row = cld(Index,blockDim().x)
Col = Index % blockDim().x
if Col == 0
Col = blockDim().x
end
OutX = (blockIdx().x1)*blockDim().x+Col
OutY = (blockIdx().y1)*blockDim().y+Row
A_T[OutX,OutY] = ShareMem[Row,Col]
return
end
@btime CUDA.@sync @cuda(
threads = (32,32),
blocks = (cld(size(A,1),32),cld(size(A,2),32)),
MatrixTranspose1!(A,A_T)
) #time: 8.604ms
@btime CUDA.@sync @cuda(
threads = (32,32),
blocks = (cld(size(A,2),32),cld(size(A,1),32)),
MatrixTranspose2!(A,A_T)
)#time: 6.044ms
#Why is it faster than the first function?
@btime CUDA.@sync @cuda(
threads = (32,32),
blocks = (cld(size(A,2),32),cld(size(A,1),32)),
shmem=32*32*sizeof(Float64),
MatrixTranspose3!(A,A_T)
)#time: 8.969ms
#Why is it slower than the second function?