CuArray is Row Major or Column Major?


#1

And does it really matter for the GPU? I read it matters if the architecture supports SIMD. I am not sure if GPU supports SIMD.


#2

From the GPU’s point of view, you’re just allocating a buffer and indexing it, so there’s no predefined rules. CUDAnative’s CuDeviceArray inherits its indexing rules from Julia’s AbstractArray though, which would imply column-major storage, where the first element in a slice expression should (in CPU terms) correspond with the fastest changing index.

Does it matter? It depends. If you’re accessing global memory, make sure you’re accessing consecutive memory locations from the same warp, because then memory coalescing kicks in and merges the different accesses into a single one.

For example:

using CUDAdrv, CUDAnative
const N = 512

function kernel_a(arr, n)
    for col in 1:n
        # consecutive values accessed by
        # individual treads in the warp
        arr[threadIdx().x,col] = 42
    end
end

function kernel_b(arr, n)
    for row in 1:n
        # consecutive values accessed by
        # the same thread across iterations
        arr[row,threadIdx().x] = 42
    end
end

d_a = CuArray{Int}((N,N))

for i in 1:10
    println(CUDAdrv.@elapsed begin
        @cuda (1,N) kernel_a(d_a, N)
    end)
end

println()

for i in 1:10
    println(CUDAdrv.@elapsed begin
        @cuda (1,N) kernel_b(d_a, N)
    end)
end

kernel_a is 4 times faster due to memory coalescing. You can confirm this by looking at nvprof, e.g. recording the global_store_transaction event for both kernels. Kernel B performs 262144 (512*512) stores, while kernel A only needs 16384 (512*512/16, not sure why it only coalesces across the half-warp).

This only applies to global memory though. Worse, sometimes you need to do the inverse in order to avoid so-called bank conflicts: accesses within a warp for values that reside in the same memory bank (consecutive, or modulo bank size) are serialized, and cannot be executed concurrently. But that’s a pretty advanced optimization.


Optimizing the use of Blocks, Threads vs. Array Indexing