Array of structures vs. Structures of arrays

shiroghost · March 11, 2021, 9:33am

Hi all,

I am trying my first steps with CUDA.jl and I have a question about memory layout. I have a simple unmutable structure, for example:

struct Angles
  x::SVector{3, Float64}
end

I have operations like

function add!(r::Angles, x::Angles, y:.Angles)
  for k in 1:3
    r.x[i] = x.x[i] + y.x[i]
  end
  return nothing
end

function myexp!(r::Angles, x::Angles)
  r.x[1] = sin(x.x[1])
  r.x[2] = cos(x.x[2])
  r.x[3] = exp(x.x[3])
  return nothing
end

And finally I would like to perform these operations in parallel on the GPU:

r_d = CuArray{Angles,1}(1000)

function kernel_operation!(r, x, y)
  index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
  stride = blockDim().x * gridDim().x
  
  tmp = Angles(0.0, 0.0, 0.0)
  for i = index:stride:length(y)
        myexp!(tmp, x[i])
        add!(r, tmp, y [i])
    end

What worries me (I am new to CUDA.jl) is that this would be very innefficient in, for example, C. You want the dimension of the array you paralelize to be contiguos in memory, while this layout produces the components of Angles contiguous in memory. Also I would like to make sure that the routines myexp! and add! are inlined inside the loop.

What are the canonical ways to approach these problems in CUDA.jl? Are there easy ways to create an Structure of arrays, but having the possibility to define these elementary add! and myexp! routines and having them inlined?

Thanks

LaurentPlagne · March 11, 2021, 9:40am

Hi, you can have a look at :

piever · March 11, 2021, 10:41am

StructArrays should work for this. The main caveat is that SVector has a nested layout (it has a Tuple inside that actually holds the elements), so you should tell StructArrays to “unwrap” that:

using StructArrays, StaticArrays
x = StructArray([SA[1, 2, 3], SA[4, 5, 6]], unwrap = t -> t <: Tuple)

Otherwise, you could just work with Tuples instead of SVectors, that is

using StructArrays
x = StructArray([(1, 2, 3), (4, 5, 6)])

Then, StructArrays.replace_storage(CuArray, x) should map it to a StructArray of CUDA arrays, which you should be able to use in a CUDA kernel.

shiroghost · March 11, 2021, 12:13pm

Many thanks!

I would like to use Tuples, but the problem is that being inmutable I would not be able to update its values (like for example, in the add! or myexp! functions above). Is this right?

LaurentPlagne · March 11, 2021, 12:18pm

You may perform some performance experiments but I guess that returning “new” immutable angles will not cause any trouble because it will only imply stack (or even register) business.

piever · March 11, 2021, 12:46pm

SVector is also immutable, so that wouldn’t be affected. Other than the comment above though (the julia compiler is quite good with immutable struct) it’s important to note that, given the particular “struct of arrays” layout, you can cheat and modify tuples. For example:

julia> using StructArrays

julia> x = StructArray([(1, 2, 3), (4, 5, 6)])
2-element StructArray(::Vector{Int64}, ::Vector{Int64}, ::Vector{Int64}) with eltype Tuple{Int64, Int64, Int64}:
 (1, 2, 3)
 (4, 5, 6)

julia> StructArrays.component(x, 1) # get first column
2-element Vector{Int64}:
 1
 4

julia> StructArrays.component(x, 1)[2] = 0 # change first entry of second row
0

julia> x
2-element StructArray(::Vector{Int64}, ::Vector{Int64}, ::Vector{Int64}) with eltype Tuple{Int64, Int64, Int64}:
 (1, 2, 3)
 (0, 5, 6)

julia> x[2][1] = 0 # this instead would error as `Tuples` are immutable
ERROR: MethodError: no method matching setindex!(::Tuple{Int64, Int64, Int64}, ::Int64, ::Int64)
Stacktrace:
 [1] top-level scope
   @ REPL[9]:1

shiroghost · March 11, 2021, 2:06pm

Hi Laurent, piever

Many thanks. I only have a couple of questions regarding your comments.

About the solution proposed by Laurent: Can I assume that immutables types will not cause extra allocations when running on the GPU. (I have tested and on the CPU this works fine).
About the solution proposed by piever: Would this trick/cheat perform well on the GPU?

LaurentPlagne · March 11, 2021, 2:16pm

I think that @maleadt can answer more precisely but I think that it wrote this here:

piever · March 11, 2021, 3:02pm

I haven’t used StructArrays directly in CUDA kernels much, so I don’t really have a strong intuition here, but in principle if CUDA can figure out that it should turn the components of the StructArray into CuDeviceArrays, this trick should be pretty efficient.

shiroghost · March 23, 2021, 9:50am

Hi again,

Many thanks for the suggestions. Certainly StructArrays sounds ver interesting, but I cannot manage to make it work in CUDA kernels. A simple example:

using StructArrays, CUDA, BenchmarkTools

struct pt
    x::Float64
    y::Float64
end

function initpt!(a)
    index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x
    for i in index:stride:length(a)
        a[i] = pt(1.0, 2.0)
    end
    return nothing
end

ndata = 10240
thdx  = 256
numblocks = ceil(Int, ndata/thdx)

println("Starting TEST: Memory allocation and transfer")

cua1 = StructArray{pt}((zeros(ndata), zeros(ndata)))
replace_storage(CuArray, cua1)
println("  GPU initialization")
@btime @cuda threads=thdx blocks=numblocks initpt!(cua1)

Fails with the error message:

ERROR: LoadError: GPU compilation of kernel initpt!(StructArray{pt,1,NamedTuple{(:x, :y),Tuple{Array{Float64,1},Array{Float64,1}}},Int64}) failed
KernelError: passing and using non-bitstype argument

Argument 2 to your kernel function is of type StructArray{pt,1,NamedTuple{(:x, :y),Tuple{Array{Float64,1},Array{Float64,1}}},Int64}, which is not isbits:
  .components is of type NamedTuple{(:x, :y),Tuple{Array{Float64,1},Array{Float64,1}}} which is not isbits.
    .x is of type Array{Float64,1} which is not isbits.
    .y is of type Array{Float64,1} which is not isbits.

I am a bit puzzled because the example is basically a copy of the example provided in the StructArrays github page, except that I pack the values of x and y in an immutable struct that has a known length.

Does anyone have concrete examples of CUDA.jl that might be relevant for this case?

Thanks

maleadt · March 23, 2021, 12:38pm

You need to assign the result of replace_storage. Furthermore, also always use @btime CUDA.@sync ....

Topic		Replies	Views
New to Julia and Formatting StructArrays for GPU use with mutable scalars GPU staticarrays , structarrays	2	610	February 13, 2023
How to properly pass structs into GPU? (MWE included) GPU	6	1343	January 29, 2023
Best way to store data for GPU use GPU	7	1684	February 16, 2019
CUDA.@cuStaticSharedMem returning a Structure of Arrays GPU	0	330	July 26, 2021
Passing mutable struct to kernel GPU	7	2128	January 29, 2023

Array of structures vs. Structures of arrays

Related topics