ParallelStencil for Vector Fields

Thanks for your comprehensive answer!

That’s too bad, that non-primitive datatypes do not work with ParallelStencil!
In lattice simulations in high energy physics, non-primitive datatypes at lattice sites are very frequent (mostly matrices)…would have been great if ParallelStencil could have done it. Anyways, it works for my N component vector case:

About the x = @zeros(nx,ny,nz,N) implementation:
The array length should be a multiple of 32 for the GPU, right? Threads per Box?
Is this value independent of the datatype - same for Float32 or Float64?
My N will be less than 32, roughly at values of 1,4,8,16, therefore I guess I will go with the second approach with splatting these arrays and rewriting the kernel for various N.

thanks again!