Performance of array of structs

PharmCat · March 4, 2024, 8:26pm

I compared array-of-struct and struct-of-array and seems struct-of-array is faster.

struct MyStr{T}
    Δxˣ::T
    Δxʸ::T
    Δvˣ::T
    Δvʸ::T
    ρᵢ::T
    ρⱼ::T
end

function pairs_calk!(buff, pairs; minthreads::Int = 1024) 
    gpukernel = @cuda launch=false kernel_pairs_calk!(buff, pairs) 
    config = launch_configuration(gpukernel.fun)
    Nx = length(pairs)
    maxThreads = config.threads
    Tx  = min(minthreads, maxThreads, Nx)
    Bx  = cld(Nx, Tx)
    CUDA.@sync gpukernel(buff, pairs; threads = Tx, blocks = Bx)
end
function kernel_pairs_calk!(buff, pairs) 
    index = (blockIdx().x - Int32(1)) * blockDim().x + threadIdx().x
    if index <= length(pairs)
        buff[1][index] = (1.2, 2.3)
        buff[2][index] = (4.5, 6.7)
        buff[3][index] = 0.1
        buff[4][index] = 0.7
    end
    return nothing
end
function pairs_calk2!(buff, pairs; minthreads::Int = 1024) 
    gpukernel = @cuda launch=false maxregs=64 kernel_pairs_calk2!(buff, pairs) 
    config = launch_configuration(gpukernel.fun)
    Nx = length(pairs)
    maxThreads = config.threads
    Tx  = min(minthreads, maxThreads, Nx)
    Bx  = cld(Nx, Tx)
    CUDA.@sync gpukernel(buff, pairs; threads = Tx, blocks = Bx)
end
function kernel_pairs_calk2!(buff, pairs) 
    index = (blockIdx().x - Int32(1)) * blockDim().x + threadIdx().x
    if index <= length(buff)
        buff[index] = MyStr{Float64}(0.1, 0.2, 0.3, 0.4, 0.9, 1.2)
    end
    return nothing
end

PN = 10000
buff = (CUDA.fill((zero(Float64), zero(Float64)), PN), CUDA.fill((zero(Float64), zero(Float64)), PN), CUDA.zeros(Float64, PN), CUDA.zeros(Float64, PN))
pairs = CUDA.zeros(PN )
@benchmark pairs_calk!($buff,$pairs ; minthreads = 1024)
#  ~ 20.639 μs

buff2 = CuArray{MyStr{Float64}}(undef, PN)

@benchmark pairs_calk2!($buff2,$pairs ; minthreads = 1024)
 # ~ 25.068 μs

Is it expected behavior?

Zentrik · March 4, 2024, 9:17pm

There’s no reason why they would have the same performance, they have different memory layouts which means one is typically faster than the other. In this case it’s not obvious to me why one is faster than the other but I suspect because your struct is 6 elements long, writing it doesn’t vectorize as well.

EDIT: Given your using Float64s I think my suspicion is probably wrong.

PharmCat · March 5, 2024, 6:27am

If I use 8 vectors of Float32 vs 1 vector of tuples with 8 Float32 I also have difference. So, is it can be a rule to use struct-of-array instead array-of-struct?

Zentrik · March 5, 2024, 12:31pm

I don’t think there’s any rule about which to use, you need to think about your memory access pattern and choose which is more appropriate in terms of vectorization and coalescing.

Given you’re writing to every field in the struct I wouldn’t have expected a difference but it’s hard to say without looking at a profile/ native code.

Zentrik · March 5, 2024, 12:34pm

I suspect if you mainly access/write to only a subset of your fields a struct of arrays layout would be better.

Topic		Replies	Views
Array of Structs vs Structs of Arrays Performance	2	223	November 4, 2024
Array of structures vs. Structures of arrays GPU question	10	2774	March 23, 2021
Complex vector math performance: StructArray of CuArrays vs CuArrays Performance question , gpu , performance , cuarrays , dsp	2	630	June 22, 2020
How to improve the performance of CUDA kernel function which loop on a large struct array GPU question , gpu	4	173	November 28, 2024
StructArrays creating a lot of allocations compared to arrays Performance memory-allocation , data_structures , arrays , structarrays	2	745	February 1, 2021

Performance of array of structs

Related topics