Array of structures vs. Structures of arrays

Hi all,

I am trying my first steps with CUDA.jl and I have a question about memory layout. I have a simple unmutable structure, for example:

struct Angles
  x::SVector{3, Float64}
end

I have operations like

function add!(r::Angles, x::Angles, y:.Angles)
  for k in 1:3
    r.x[i] = x.x[i] + y.x[i]
  end
  return nothing
end

function myexp!(r::Angles, x::Angles)
  r.x[1] = sin(x.x[1])
  r.x[2] = cos(x.x[2])
  r.x[3] = exp(x.x[3])
  return nothing
end

And finally I would like to perform these operations in parallel on the GPU:

r_d = CuArray{Angles,1}(1000)

function kernel_operation!(r, x, y)
  index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
  stride = blockDim().x * gridDim().x
  
  tmp = Angles(0.0, 0.0, 0.0)
  for i = index:stride:length(y)
        myexp!(tmp, x[i])
        add!(r, tmp, y [i])
    end

What worries me (I am new to CUDA.jl) is that this would be very innefficient in, for example, C. You want the dimension of the array you paralelize to be contiguos in memory, while this layout produces the components of Angles contiguous in memory. Also I would like to make sure that the routines myexp! and add! are inlined inside the loop.

What are the canonical ways to approach these problems in CUDA.jl? Are there easy ways to create an Structure of arrays, but having the possibility to define these elementary add! and myexp! routines and having them inlined?

Thanks

Hi, you can have a look at :

2 Likes

StructArrays should work for this. The main caveat is that SVector has a nested layout (it has a Tuple inside that actually holds the elements), so you should tell StructArrays to “unwrap” that:

using StructArrays, StaticArrays
x = StructArray([SA[1, 2, 3], SA[4, 5, 6]], unwrap = t -> t <: Tuple)

Otherwise, you could just work with Tuples instead of SVectors, that is

using StructArrays
x = StructArray([(1, 2, 3), (4, 5, 6)])

Then, StructArrays.replace_storage(CuArray, x) should map it to a StructArray of CUDA arrays, which you should be able to use in a CUDA kernel.

2 Likes

Many thanks!

I would like to use Tuples, but the problem is that being inmutable I would not be able to update its values (like for example, in the add! or myexp! functions above). Is this right?

You may perform some performance experiments but I guess that returning “new” immutable angles will not cause any trouble because it will only imply stack (or even register) business.

1 Like

SVector is also immutable, so that wouldn’t be affected. Other than the comment above though (the julia compiler is quite good with immutable struct) it’s important to note that, given the particular “struct of arrays” layout, you can cheat and modify tuples. For example:

julia> using StructArrays

julia> x = StructArray([(1, 2, 3), (4, 5, 6)])
2-element StructArray(::Vector{Int64}, ::Vector{Int64}, ::Vector{Int64}) with eltype Tuple{Int64, Int64, Int64}:
 (1, 2, 3)
 (4, 5, 6)

julia> StructArrays.component(x, 1) # get first column
2-element Vector{Int64}:
 1
 4

julia> StructArrays.component(x, 1)[2] = 0 # change first entry of second row
0

julia> x
2-element StructArray(::Vector{Int64}, ::Vector{Int64}, ::Vector{Int64}) with eltype Tuple{Int64, Int64, Int64}:
 (1, 2, 3)
 (0, 5, 6)

julia> x[2][1] = 0 # this instead would error as `Tuples` are immutable
ERROR: MethodError: no method matching setindex!(::Tuple{Int64, Int64, Int64}, ::Int64, ::Int64)
Stacktrace:
 [1] top-level scope
   @ REPL[9]:1
2 Likes

Hi Laurent, piever

Many thanks. I only have a couple of questions regarding your comments.

  1. About the solution proposed by Laurent: Can I assume that immutables types will not cause extra allocations when running on the GPU. (I have tested and on the CPU this works fine).

  2. About the solution proposed by piever: Would this trick/cheat perform well on the GPU?

I think that @maleadt can answer more precisely but I think that it wrote this here:

1 Like

I haven’t used StructArrays directly in CUDA kernels much, so I don’t really have a strong intuition here, but in principle if CUDA can figure out that it should turn the components of the StructArray into CuDeviceArrays, this trick should be pretty efficient.

Hi again,

Many thanks for the suggestions. Certainly StructArrays sounds ver interesting, but I cannot manage to make it work in CUDA kernels. A simple example:

using StructArrays, CUDA, BenchmarkTools

struct pt
    x::Float64
    y::Float64
end

function initpt!(a)
    index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x
    for i in index:stride:length(a)
        a[i] = pt(1.0, 2.0)
    end
    return nothing
end

ndata = 10240
thdx  = 256
numblocks = ceil(Int, ndata/thdx)

println("Starting TEST: Memory allocation and transfer")

cua1 = StructArray{pt}((zeros(ndata), zeros(ndata)))
replace_storage(CuArray, cua1)
println("  GPU initialization")
@btime @cuda threads=thdx blocks=numblocks initpt!(cua1)

Fails with the error message:

ERROR: LoadError: GPU compilation of kernel initpt!(StructArray{pt,1,NamedTuple{(:x, :y),Tuple{Array{Float64,1},Array{Float64,1}}},Int64}) failed
KernelError: passing and using non-bitstype argument

Argument 2 to your kernel function is of type StructArray{pt,1,NamedTuple{(:x, :y),Tuple{Array{Float64,1},Array{Float64,1}}},Int64}, which is not isbits:
  .components is of type NamedTuple{(:x, :y),Tuple{Array{Float64,1},Array{Float64,1}}} which is not isbits.
    .x is of type Array{Float64,1} which is not isbits.
    .y is of type Array{Float64,1} which is not isbits.

I am a bit puzzled because the example is basically a copy of the example provided in the StructArrays github page, except that I pack the values of x and y in an immutable struct that has a known length.

Does anyone have concrete examples of CUDA.jl that might be relevant for this case?

Thanks

You need to assign the result of replace_storage. Furthermore, also always use @btime CUDA.@sync ....