CuArray is Row Major or Column Major?

drjoke · November 30, 2017, 6:24am

And does it really matter for the GPU? I read it matters if the architecture supports SIMD. I am not sure if GPU supports SIMD.

maleadt · November 30, 2017, 7:49am

From the GPU’s point of view, you’re just allocating a buffer and indexing it, so there’s no predefined rules. CUDAnative’s CuDeviceArray inherits its indexing rules from Julia’s AbstractArray though, which would imply column-major storage, where the first element in a slice expression should (in CPU terms) correspond with the fastest changing index.

Does it matter? It depends. If you’re accessing global memory, make sure you’re accessing consecutive memory locations from the same warp, because then memory coalescing kicks in and merges the different accesses into a single one.

For example:

using CUDAdrv, CUDAnative
const N = 512

function kernel_a(arr, n)
    for col in 1:n
        # consecutive values accessed by
        # individual treads in the warp
        arr[threadIdx().x,col] = 42
    end
end

function kernel_b(arr, n)
    for row in 1:n
        # consecutive values accessed by
        # the same thread across iterations
        arr[row,threadIdx().x] = 42
    end
end

d_a = CuArray{Int}((N,N))

for i in 1:10
    println(CUDAdrv.@elapsed begin
        @cuda (1,N) kernel_a(d_a, N)
    end)
end

println()

for i in 1:10
    println(CUDAdrv.@elapsed begin
        @cuda (1,N) kernel_b(d_a, N)
    end)
end

kernel_a is 4 times faster due to memory coalescing. You can confirm this by looking at nvprof, e.g. recording the global_store_transaction event for both kernels. Kernel B performs 262144 (512*512) stores, while kernel A only needs 16384 (512*512/16, not sure why it only coalesces across the half-warp).

This only applies to global memory though. Worse, sometimes you need to do the inverse in order to avoid so-called bank conflicts: accesses within a warp for values that reside in the same memory bank (consecutive, or modulo bank size) are serialized, and cannot be executed concurrently. But that’s a pretty advanced optimization.

Topic		Replies	Views
Optimizing the use of Blocks, Threads vs. Array Indexing GPU	15	3207	September 21, 2018
cuArrays vs CUDANative GPU	3	1347	November 14, 2018
Is it possible to index a CuArray with a CuArray? GPU question	1	830	January 11, 2019
CuArray local scope memory issue GPU	4	294	January 4, 2023
CUDAnative: kernel multidimensional access GPU cudanative	3	1158	February 3, 2017

CuArray is Row Major or Column Major?

Related topics