New to Julia and Formatting StructArrays for GPU use with mutable scalars

First, I apologize if the long rambling post is not in keeping with expectations here and would like to preemptively thank anyone with the patience to read part or all of this and provide any feedback/guidance.

Background:
I am working on a particle code using the Material Point Method (MPM) and have been trying to layout how I might approach this problem in Julia. I am new to both Julia and GPU computing but want to use Julia specifically because in my mind it should be the lowest barrier to entry to achieve GPU computing. Ideally I also want to layout my solver using kernels which generally make sense on both CPU and GPU to avoid duplicitous work. I am aware of ParallelStencil.jl and ImplicitGlobalGrid.jl but it is not obvious to me how I might adapt that to use with MPM. (as an aside, If anyone has input on how this might be done that would be very helpful)

For those of you not familiar with MPM [1][2] you have both a background (generally cartesian) grid of nodes and a host of discrete particles which advect within the domain and across the grid. Both store information which is exchanged back and forth. There have been a number of recent advances where skilled hands have tailored the method to GPU computing with great success[1][2][3].

Ideally I would like to replicate (if only by small degree) this success myself and enable large (1 Million+ particles, sparse many million cell grids) on a single GPU of my laptop. I know this should be feasible as Taichi[1][2] is doing exactly that as an extension to python using C.

I would use Taichi but it is fairly esoteric language and it was not obvious to me that it could handle streaming onto and off of the GPU easily in a way that woudl allow me to escape the memory limitations of a single GPU. Having said that both the mpm128 and euler solvers using Taichi and similar to what I am trying to accomplish but with a broader range of physics based constitutive models and equations of state. Additionally, I might want to do a dual grid or other more complicated setup and prefer the flexibility of rolling my own for this purpose. As an aside if you have not played with Taichi in python you totally should.

My objectives:

  1. establish Array of Struct of Array (AoSoA) data container layout in support of the domain partitioning required to achieve both efficient streaming and efficient stencil operations without overflowing the limited GPU memory available. This will be used to represent the grid, particles, or a subset of either (Really just working on the SoA part right now.)
  2. ideally maintain mutability of struct field values both when viewed as arrays or as structs (more below on this, I am probably asking the impossible/impractical)
  3. should have a minimal memory footprint on either CPU or GPU
  4. Ideally it should be fast to index into either at random or by strides and fast to transfer to GPU from CPU and back.

What I have tried:

  • Using various forms of StructArrays and structs

MWE below:

using StructArrays
using StaticArrays
using CUDA

#   ns = Node_scalar{Float32}(1,2.,ones(3),zeros(3,3))
struct Node_scalar{T}
    active::Bool
    mass::T
    velocity::SVector{3,T}
    affine::SMatrix{3,3,T,9}
end

#   nr = Node_ref{Float32}(1,2.,ones(3),zeros(3,3))
struct Node_ref{T}
    active::Ref{Bool}
    mass::Ref{T}
    velocity::SVector{3,T}
    affine::SMatrix{3,3,T,9}
end

#Required different call signature to construct
#   n0 = Node_0d{Float32}([true],[2.],ones(3),zeros(3,3))
#   ie - 1 length SVectors must be 1 length AbstractArray to convert
struct Node_0d{T}
    active::SVector{1,Bool}
    mass::SVector{1,T}
    velocity::SVector{3,T}
    affine::SMatrix{3,3,T,9}
end

# Has direct mutability of scalar fields?
#   Node_scalar     False       ERROR: setfield! immutable struct
#   Node_ref        True        nr.mass[]=50
#   Node_0d         False
#                    but ... tst[1]=2, tst[]=2,  tst.=2 (must be SizedVector or MVector)

#What about are they bits type to allow easy conversion to GPU CUDA array?
isbitstype(Node_scalar{Float32})    #   true    (only if used with SMatrix)
isbitstype(Node_ref{Float32})       #   false
isbitstype(Node_0d{Float32})        #   true    (only if static SVector or SMatrix types)


@benchmark Node_scalar{Float32}($1,$2.,$ones(3),$zeros(3,3))

# BenchmarkTools.Trial: 10000 samples with 982 evaluations.
#  Range (min … max):  63.136 ns … 866.904 ns  ┊ GC (min … max): 0.00% … 88.49%
#  Time  (median):     66.599 ns               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   73.215 ns ±  41.516 ns  ┊ GC (mean ± σ):  4.03% ±  6.61%
#
#   ▃██▆▅▃▂▁                                                     ▁
#   ████████▇▇▆▆▇▇▇▇▆▅▅▆▅▆▆▆▆▇▄▆▅▆▆▆▆▇▆▆▇██▇▇▅▅▅▄▅▄▅▅▄▅▅▅▅▃▄▄▄▄▄ █
#   63.1 ns       Histogram: log(frequency) by time       153 ns <
#
#  Memory estimate: 272 bytes, allocs estimate: 2.

@benchmark Node_ref{Float32}($1,$2.,$ones(3),$zeros(3,3))

# BenchmarkTools.Trial: 10000 samples with 978 evaluations.
#  Range (min … max):  66.871 ns …  1.266 μs  ┊ GC (min … max): 0.00% … 86.18%
#  Time  (median):     72.393 ns              ┊ GC (median):    0.00%
#  Time  (mean ± σ):   80.020 ns ± 58.395 ns  ┊ GC (mean ± σ):  5.59% ±  7.15%
#
#   ▂▆██▇▆▄▂                                                    ▂
#   █████████▇▆▅▅▆▇▆▅▅▅▄▅▆▆▆▄▅▆▄▆▅▆▆▅▆▆▆▆▇▇▇▇▇▇▇▇▆▅▅▅▄▅▄▄▄▅▅▄▄▄ █
#   66.9 ns      Histogram: log(frequency) by time       159 ns <
#
#  Memory estimate: 304 bytes, allocs estimate: 4.


@benchmark Node_0d{Float32}($[1],$[2.],$ones(3),$zeros(3,3))

# BenchmarkTools.Trial: 10000 samples with 985 evaluations.
#  Range (min … max):  61.421 ns … 704.975 ns  ┊ GC (min … max): 0.00% … 80.20%
#  Time  (median):     65.381 ns               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   70.463 ns ±  37.433 ns  ┊ GC (mean ± σ):  3.82% ±  6.57%
#
#    ▅█▇▅▃▂▁                                                     ▁
#   ▇████████▅▅▄▄▅▆▆▆▆▅▅▄▅▄▅▅▅▅▅▄▅▅▅▅▅▆▅▄▅▅▆▅▅▇▇▇▆▅▄▅▅▄▅▂▅▅▄▃▄▅▄ █
#   61.4 ns       Histogram: log(frequency) by time       140 ns <
#
#  Memory estimate: 272 bytes, allocs estimate: 2.


function get_ns() return Node_scalar{Float32}(1,2.,ones(3),zeros(3,3)) end
function get_nr() return Node_ref{Float32}(1,2.,ones(3),zeros(3,3)) end
function get_n0() return Node_0d{Float32}([1],[2.],ones(3),zeros(3,3)) end


ns = get_ns()
nr = get_nr()
n0 = get_n0()

@benchmark ns.mass[]

# BenchmarkTools.Trial: 10000 samples with 984 evaluations.
#  Range (min … max):  56.301 ns … 771.037 ns  ┊ GC (min … max): 0.00% … 91.71%
#  Time  (median):     58.435 ns               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   62.244 ns ±  28.068 ns  ┊ GC (mean ± σ):  1.72% ±  3.74%
#
#   ▅▇█▄▂▁            ▂▂                                         ▁
#   █████████▇▆▆▅▅▆█▇████▇▅▅▆▅▄▃▅▅▅▄▄▄▅▆▆▆▅▆▅▅▆▇▆▆▅▅▄▅▄▄▃▃▄▅▅▄▄▄ █
#   56.3 ns       Histogram: log(frequency) by time       114 ns <
#
#  Memory estimate: 80 bytes, allocs estimate: 2.

 @benchmark nr.mass[]

#  BenchmarkTools.Trial: 10000 samples with 986 evaluations.
#  Range (min … max):  54.868 ns …  1.509 μs  ┊ GC (min … max): 0.00% … 95.62%
#  Time  (median):     57.505 ns              ┊ GC (median):    0.00%
#  Time  (mean ± σ):   61.962 ns ± 60.731 ns  ┊ GC (mean ± σ):  4.28% ±  4.23%
#
#   ▃▆██▅▄▂▁                 ▁▂▁                                ▂
#   █████████▆▆▆▆▆▅▆▆▅▆▄▆▆▆▅█████▇▆▆▄▆▄▅▄▅▆▅▅▆▆▄▃▄▅▅▅▆▅▅▄▅▅▅▁▄▅ █
#   54.9 ns      Histogram: log(frequency) by time       104 ns <

 Memory estimate: 96 bytes, allocs estimate: 2.

 @benchmark n0.mass[]

#  BenchmarkTools.Trial: 10000 samples with 980 evaluations.
#  Range (min … max):  65.000 ns …  1.376 μs  ┊ GC (min … max): 0.00% … 85.55%
#  Time  (median):     68.469 ns              ┊ GC (median):    0.00%
#  Time  (mean ± σ):   72.976 ns ± 32.758 ns  ┊ GC (mean ± σ):  1.81% ±  3.99%
#
#   ▄▇█▇▅▇▆▃▂▂▁      ▂▃▂                                        ▂
#   ███████████▇██▇▇███████▇▇▇▇▇▇▆▇▇▇▆▆▆▆▅▆▄▅▅▆▆▅▅▅▃▅▅▄▅▄▄▅▁▄▄▅ █
#   65 ns        Histogram: log(frequency) by time       132 ns <
#
#  Memory estimate: 96 bytes, allocs estimate: 3.


Base.summarysize(ns)    #56
Base.summarysize(nr)    #69
Base.summarysize(n0)    #56

nsS = StructArray{Node_scalar{Float32}}(undef,100)
# nrS = StructArray{Node_ref{Float32}}(undef,100)   did not assign memory
nrS = StructArray{Node_ref{Float32}}(undef,100) # there has to be a better way to do this
nrS.=[Node_ref{Float32}(rand(Bool),rand(Float32),randn(3),randn(3,3)) for i in 1:100]
n0S = StructArray{Node_0d{Float32}}(undef,100)

Base.summarysize(nsS)   #5524 -> 55.24 ea. -> -1.3% savings!
Base.summarysize(nrS)   #7124 -> 71.24 ea. -> +3.2%
Base.summarysize(n0S)   #5524 -> 55.24 ea. -> -1.3% savings!

# Broadcastable update to fields of the StructArray ...
# -----------------------------------------------------

# for nsS of Node_scalar it can be done and is efficient

@benchmark nsS.mass.=30

# BenchmarkTools.Trial: 10000 samples with 986 evaluations.
#  Range (min … max):  54.158 ns …  1.509 μs  ┊ GC (min … max): 0.00% … 95.90%
#  Time  (median):     56.187 ns              ┊ GC (median):    0.00%
#  Time  (mean ± σ):   59.408 ns ± 44.051 ns  ┊ GC (mean ± σ):  2.29% ±  3.02%
#
#   ▆▇█▅▃▂▁▁▁ ▁▃▄▂                                              ▂
#   ███████████████▇▅▄▆▅▆▇▆▆▅▅▅▅▅▅▅▇▆▆▆▄▄▄▄▅▄▅▄▅▁▄▄▄▄▄▅▅▄▅▅▃▅▅▆ █
#   54.2 ns      Histogram: log(frequency) by time       103 ns <
#
#  Memory estimate: 48 bytes, allocs estimate: 1.

@benchmark nsS.mass[1]=30

# BenchmarkTools.Trial: 10000 samples with 987 evaluations.
#  Range (min … max):  47.720 ns …  1.809 μs  ┊ GC (min … max): 0.00% … 96.82%
#  Time  (median):     48.936 ns              ┊ GC (median):    0.00%
#  Time  (mean ± σ):   53.669 ns ± 45.003 ns  ┊ GC (mean ± σ):  2.56% ±  3.04%
#
#   ██▄▇▃▁    ▂▄▃                                               ▂
#   ██████▇▇▆▇████▆▆▆▇█▇▇▆▅▆▆▆▆▇▆▆▆▆▅▅▅▅▆▆▅▆▅▆▅▆▅▅▅▆▆▅▅▅▅▆▆▅▅▄▅ █
#   47.7 ns      Histogram: log(frequency) by time       103 ns <
#
#  Memory estimate: 48 bytes, allocs estimate: 1.


# for nrS of Node_ref it can be done and is efficient

@benchmark nrS.mass.=30

# BenchmarkTools.Trial: 10000 samples with 952 evaluations.
#  Range (min … max):   96.954 ns …  2.239 μs  ┊ GC (min … max): 0.00% … 94.64%
#  Time  (median):      98.845 ns              ┊ GC (median):    0.00%
#  Time  (mean ± σ):   105.328 ns ± 62.027 ns  ┊ GC (mean ± σ):  2.06% ±  3.38%
#
#   ▇█▅▃▅▅▃▁  ▂▃▁                                                ▁
#   ████████▆▇████▇▆▅▅▅▅▅▅▄▆▄▅▆▇▅▅▅▅▅▅▄▅▅▆▆▅▅▃▅▅▆▄▅▆▅▆▅▅▆▇▆▅▄▃▃▄ █
#   97 ns         Histogram: log(frequency) by time       176 ns <
#
#  Memory estimate: 64 bytes, allocs estimate: 2.=#

@benchmark nrS.mass[1]=30

# BenchmarkTools.Trial: 10000 samples with 987 evaluations.
#  Range (min … max):  52.381 ns …  2.333 μs  ┊ GC (min … max): 0.00% … 95.75%
#  Time  (median):     54.306 ns              ┊ GC (median):    0.00%
#  Time  (mean ± σ):   58.477 ns ± 60.918 ns  ┊ GC (mean ± σ):  3.64% ±  3.45%
#
#   ▇██▅▃        ▁▃▂                                            ▂
#   ███████▆▇▆▅▅▇████▆▇▆▆▅▅▅▄▅▅▅▅▅▄▃▄▄▂▅▅▅▅▄▅▅▄▄▄▅▄▃▅▄▄▄▄▄▃▅▄▅▄ █
#   52.4 ns      Histogram: log(frequency) by time       109 ns <
#
#  Memory estimate: 64 bytes, allocs estimate: 2.


# For n0S of Node_0d it can be done... but requires some gymnastics
# Where the array from which we broadcast already exists this is efficient

@benchmark n0S.mass .= $[[2.] for i in length(n0S.mass)]

# BenchmarkTools.Trial: 10000 samples with 982 evaluations.
#  Range (min … max):  59.878 ns …  1.968 μs  ┊ GC (min … max): 0.00% … 96.39%
#  Time  (median):     62.118 ns              ┊ GC (median):    0.00%
#  Time  (mean ± σ):   66.336 ns ± 57.825 ns  ┊ GC (mean ± σ):  3.08% ±  3.44%
#
#   ▇▆█▇▄▃▂▁      ▁▂▃▁                                          ▂
#   █████████▆▇▆▅▇████████▇▇▇▆▇▅▆▆▆▄▄▆▅▁▆▆▄▆▅▅▄▅▆▆▅▆▅▅▅▄▆▅▆▅▆▆▄ █
#   59.9 ns      Histogram: log(frequency) by time       114 ns <
#
#  Memory estimate: 64 bytes, allocs estimate: 2.

# But allocating a temporary array is costly
@benchmark n0S.mass .= [[2.] for i in length(n0S.mass)]

# BenchmarkTools.Trial: 10000 samples with 184 evaluations.
#  Range (min … max):  575.000 ns …  15.841 μs  ┊ GC (min … max): 0.00% … 95.57%
#  Time  (median):     596.739 ns               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   634.007 ns ± 495.806 ns  ┊ GC (mean ± σ):  2.77% ±  3.43%
#
#   ▅▆█▆▄▃▃▂▁▂▂▂▁▁                                                ▂
#   ██████████████▇▇▆▇▆▆▇▅▅▆▅▆▆▆▆▆▆▅▆▇▆▆▆▅▅▆▅▆▆▅▅▅▃▅▄▄▅▅▃▆▄▄▄▅▁▄▃ █
#   575 ns        Histogram: log(frequency) by time       1.09 μs <
#
#  Memory estimate: 336 bytes, allocs estimate: 7.

# THIS IS THE WRONG WAY!!!
# cu_nsS = CuArray(nsS)
# cu_nrS = CuArray(nrS)
# cu_n0S = CuArray(n0S)
# All appear as 128 byte objects
# Base.summarysize(cu_nsS)
# Base.summarysize(cu_nrS)
# Base.summarysize(cu_n0S) 

# The correct way
cu_nsS = StructArray(CuArray(nsS))
# cu_nrS = StructArray(CuArray(nrS))  # will FAIL because NOT isbitstype(Node_ref{Float32})
#   ERROR: CuArray only supports element types that are stored inline
cu_n0S = StructArray(CuArray(n0S))

I mostly run into issue on the mutability point and think I might be asking for mutually features. I can set the individual scalar values of the CuArray entries in the StructArray like

cu_nsS.mass.=2.34
cu_nsS.velocity[1]=[2.,3.,4.]

But if I create a struct (which now lives on the GPU) from this and which is essentially a view pointing to the arrays

tst_struct = cu_nsS[1]

I can no longer modify any of the fields because the struct is immutable.

tst_strct.mass=2.34
tst_strct.velocity=[2.,3.,4.]

I can see where it would be convenient to change my perspective from SoA back to struct on the fly. Except that I can no longer easily modify the struct fields and also apparently generation of the struct on the GPU is much much slower than on the CPU

julia> @benchmark $nsS[$rand(1:100)]
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  12.513 ns … 58.458 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     13.814 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   14.294 ns ±  2.458 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

        █▆ ▆▅▁▁                                               ▂
  ▆▅▁▅▅▇████████▆█▇▆▆▃▁▄▄▄▅▄▃▁▁▄▃▃▅▅▅▄▄▃▃▃▄▁▅▅▄▄▁▃▄▄▃▁▃▃▁▃▃▁▃ █
  12.5 ns      Histogram: log(frequency) by time        24 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark $cu_nsS[$rand(1:100)]
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  43.000 μs … 399.200 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     51.400 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   61.242 μs ±  20.682 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▃█▇
  ▁▂███▅▂▁▂▂▂▅▅▄▃▁▁▁▂▃▆▆▄▂▂▁▁▁▁▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  43 μs           Histogram: frequency by time          134 μs <

 Memory estimate: 2.03 KiB, allocs estimate: 12.

Many related questions:

  1. Is it possible to cheat stuct immutability without incurring a large penalty cost? Similar to Ref() but which satisfies the isbitstype() requirement? I imagine it would require some kind of pointer gymnastics and defining a bunch of helper functions (which would likely not be fast)?
  2. Why is obtaining a struct view from StructArray on the GPU so very slow? Does this have something to do with the scalar indexing warning I got after first creating the CUDA version of the StructArray? (see below)
  3. Is there a direct way to find the memory footprint of a struct on the GPU similar to Base.summarysize() I am aware of CUDA.memory_status() but this only provides a top level overview
  4. Am I thinking about this all wrong? Is there a more Julia way to approach this? (my background is primarily python and OOP; functional programming combined with abstract types has been throwing me for a loop)

Regarding indexing into GPU arrays:

julia> StructArray(CuArray(nsS))
100-element StructArray(::CuArray{Bool, 1, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ::CuArray{SVector{3, Float32}, 1, CUDA.Mem.DeviceBuffer}, ::CuArray{SMatrix{3, 3, Float32, 9}, 1, CUDA.Mem.DeviceBuffer}) with eltype Node_scalar{Float32}:
┌ Warning: Performing scalar indexing on task Task (runnable) @0x000000000ea139e0.
│ Invocation of getindex resulted in scalar indexing of a GPU array.
│ This is typically caused by calling an iterating implementation of a method.
│ Such implementations *do not* execute on the GPU, but very slowly on the CPU,
│ and therefore are only permitted from the REPL for prototyping purposes.
│ If you did intend to index this array, annotate the caller with @allowscalar.

Which I also get the first time I try to index into any CuArray from the REPL

julia> using CUDA
julia> tst = CuArray{Float32}(undef,100)
julia> tst[1]
┌ Warning: Performing scalar indexing on task Task (runnable) @0x000000000e4d0010.
│ Invocation of getindex resulted in scalar indexing of a GPU array.
│ This is typically caused by calling an iterating implementation of a method.
│ Such implementations *do not* execute on the GPU, but very slowly on the CPU,
│ and therefore are only permitted from the REPL for prototyping purposes.
│ If you did intend to index this array, annotate the caller with @allowscalar.
└ @ GPUArraysCore C:\Users\DIi\.julia\packages\GPUArraysCore\B3xv7\src\GPUArraysCore.jl:103
0.0f0

I admit I haven’t read your whole message, but you might be interested in:

This package is co-developed with ParallelStencil.jl (and is also used in GeoParams.jl). Here you can see how to use it with ParallelStencil:
https://omlins.github.io/CellArrays.jl/dev/examples/memcopyCellArray3D_ParallelStencil/
and here without ParallelStencil:
https://omlins.github.io/CellArrays.jl/dev/examples/memcopyCellArray3D/

Thanks I will check that out.