Performance of array of structs

I compared array-of-struct and struct-of-array and seems struct-of-array is faster.

struct MyStr{T}
    Δxˣ::T
    Δxʸ::T
    Δvˣ::T
    Δvʸ::T
    ρᵢ::T
    ρⱼ::T
end

function pairs_calk!(buff, pairs; minthreads::Int = 1024) 
    gpukernel = @cuda launch=false kernel_pairs_calk!(buff, pairs) 
    config = launch_configuration(gpukernel.fun)
    Nx = length(pairs)
    maxThreads = config.threads
    Tx  = min(minthreads, maxThreads, Nx)
    Bx  = cld(Nx, Tx)
    CUDA.@sync gpukernel(buff, pairs; threads = Tx, blocks = Bx)
end
function kernel_pairs_calk!(buff, pairs) 
    index = (blockIdx().x - Int32(1)) * blockDim().x + threadIdx().x
    if index <= length(pairs)
        buff[1][index] = (1.2, 2.3)
        buff[2][index] = (4.5, 6.7)
        buff[3][index] = 0.1
        buff[4][index] = 0.7
    end
    return nothing
end
function pairs_calk2!(buff, pairs; minthreads::Int = 1024) 
    gpukernel = @cuda launch=false maxregs=64 kernel_pairs_calk2!(buff, pairs) 
    config = launch_configuration(gpukernel.fun)
    Nx = length(pairs)
    maxThreads = config.threads
    Tx  = min(minthreads, maxThreads, Nx)
    Bx  = cld(Nx, Tx)
    CUDA.@sync gpukernel(buff, pairs; threads = Tx, blocks = Bx)
end
function kernel_pairs_calk2!(buff, pairs) 
    index = (blockIdx().x - Int32(1)) * blockDim().x + threadIdx().x
    if index <= length(buff)
        buff[index] = MyStr{Float64}(0.1, 0.2, 0.3, 0.4, 0.9, 1.2)
    end
    return nothing
end
PN = 10000
buff = (CUDA.fill((zero(Float64), zero(Float64)), PN), CUDA.fill((zero(Float64), zero(Float64)), PN), CUDA.zeros(Float64, PN), CUDA.zeros(Float64, PN))
pairs = CUDA.zeros(PN )
@benchmark pairs_calk!($buff,$pairs ; minthreads = 1024)
#  ~ 20.639 μs

buff2 = CuArray{MyStr{Float64}}(undef, PN)

@benchmark pairs_calk2!($buff2,$pairs ; minthreads = 1024)
 # ~ 25.068 μs

Is it expected behavior?

There’s no reason why they would have the same performance, they have different memory layouts which means one is typically faster than the other. In this case it’s not obvious to me why one is faster than the other but I suspect because your struct is 6 elements long, writing it doesn’t vectorize as well.

EDIT: Given your using Float64s I think my suspicion is probably wrong.

If I use 8 vectors of Float32 vs 1 vector of tuples with 8 Float32 I also have difference. So, is it can be a rule to use struct-of-array instead array-of-struct?

I don’t think there’s any rule about which to use, you need to think about your memory access pattern and choose which is more appropriate in terms of vectorization and coalescing.

Given you’re writing to every field in the struct I wouldn’t have expected a difference but it’s hard to say without looking at a profile/ native code.

I suspect if you mainly access/write to only a subset of your fields a struct of arrays layout would be better.

1 Like