And does it really matter for the GPU? I read it matters if the architecture supports SIMD. I am not sure if GPU supports SIMD.
From the GPU’s point of view, you’re just allocating a buffer and indexing it, so there’s no predefined rules. CUDAnative’s CuDeviceArray inherits its indexing rules from Julia’s AbstractArray though, which would imply column-major storage, where the first element in a slice expression should (in CPU terms) correspond with the fastest changing index.
Does it matter? It depends. If you’re accessing global memory, make sure you’re accessing consecutive memory locations from the same warp, because then memory coalescing kicks in and merges the different accesses into a single one.
For example:
using CUDAdrv, CUDAnative
const N = 512
function kernel_a(arr, n)
for col in 1:n
# consecutive values accessed by
# individual treads in the warp
arr[threadIdx().x,col] = 42
end
end
function kernel_b(arr, n)
for row in 1:n
# consecutive values accessed by
# the same thread across iterations
arr[row,threadIdx().x] = 42
end
end
d_a = CuArray{Int}((N,N))
for i in 1:10
println(CUDAdrv.@elapsed begin
@cuda (1,N) kernel_a(d_a, N)
end)
end
println()
for i in 1:10
println(CUDAdrv.@elapsed begin
@cuda (1,N) kernel_b(d_a, N)
end)
end
kernel_a
is 4 times faster due to memory coalescing. You can confirm this by looking at nvprof
, e.g. recording the global_store_transaction
event for both kernels. Kernel B performs 262144 (512*512
) stores, while kernel A only needs 16384 (512*512/16
, not sure why it only coalesces across the half-warp).
This only applies to global memory though. Worse, sometimes you need to do the inverse in order to avoid so-called bank conflicts: accesses within a warp for values that reside in the same memory bank (consecutive, or modulo bank size) are serialized, and cannot be executed concurrently. But that’s a pretty advanced optimization.