ParallelStencil for Vector Fields

gerhardu · November 5, 2021, 10:30am

Hello,
I am trying to use ParallelStencil to solve a diffeq on a lattice, where at each lattice site I have an Array (StaticArray) instead of a float. E.g A vector field on a 3D domain.

But @init_parallel_stencil works only for “plain” numbertypes like Float64 and so on.
My idea would have been to create a new type const NVect = SVector{N, Float64} and initialize parallel stencil with it. Then @zeros(nx,ny,nz) was supposed to create the required object (like zeros(NVect, nx,ny,nz) would do). But as I said it does not work.

A workaround would be to extend my Grid by one dimension for the components of the Vector, eg x = @zeros(nx,ny,nz,N), where x[:,:,:,1] would then refer to the first vector component of the vector field. But i guess this is not as fast as StaticArrays. Also I am not sure if the FiniteDifference macros would work properly.

I saw the implementation of Stokes3D on ParallelStencil main page which creates separate “grids” for each vector component. But since the number of components of my Vector field is a parameter of my simulation this is not ideal.

Any suggestions how i could make this work? Or is there maybe another package which is better suited for this?

samo · November 5, 2021, 6:51pm

Hi @gerhardu,
indeed, using such a non-primitive datatype is not something that is currently possible and I am not sure if that would be something we want to support as it would complicate things to a large extent for applying stencil-computation-specific optimisations.

Using

eg x = @zeros(nx,ny,nz,N), where x[:,:,:,1] would then refer to the first vector component of the vector field.

as you suggest, should work fine with @parallel_indices kernel, but not with the available finite differences macros as the macros require just a symbol that represents an array as input (you would have to write your own set of macros, which is not difficult though). I don’t see any problems in terms of performance - at least as long as your arrays are multiples of 32.

A better option would probably be to create a vector of N arrays where you can create each of these arrays with the macro @zeros (or @ones or @rand). Then, you will likely want to splat these arrays into the kernels you are writing as I suggested in a similar question with the co-developped package ImplicitGlobalGrid.jl. That way you can get a symbol for each array corresponding to one of the N components you can then work with in the kernel. With this approach, your code becomes then also compatible with ImplicitGlobalGrid.jl, in this respect, in order to run your simulations on multiple distributed GPUs (or CPUs).

Here is a short MWE demonstrating the above approach in 2-D:

using ParallelStencil
using ParallelStencil.FiniteDifferences2D
@init_parallel_stencil(CUDA, Float64, 2);

@parallel function compute_something!(B, Ax)
    @inn(B) = @d2_xi(Ax)
    return
end

@parallel function compute_something!(B, Ax, Ay)
    @inn(B) = @d2_xi(Ax) + @d2_yi(Ay)
    return
end

nx = ny = 4

# Test kernel with N = 1
A = [@rand(nx,ny)]
B = @zeros(nx,ny)
@parallel compute_something!(B, A...)

# Compare with way as found in mini-apps
A_x = copy(A[1])
B_ref = @zeros(nx,ny)
@parallel compute_something!(B_ref, A_x)
B_ref ≈ B

# Test kernel with N = 2
A = [@rand(nx,ny), @rand(nx,ny)]
B = @zeros(nx,ny)
@parallel compute_something!(B, A...)

# Compare with way as found in mini-apps
A_x = copy(A[1])
A_y = copy(A[2])
B_ref = @zeros(nx,ny)
@parallel compute_something!(B_ref, A_x, A_y)
B_ref ≈ B

If you execute this in you REPL you see that this works:

julia> using ParallelStencil


julia> using ParallelStencil.FiniteDifferences2D

julia> @init_parallel_stencil(CUDA, Float64, 2);

julia> @parallel function compute_something!(B, Ax)
           @inn(B) = @d2_xi(Ax)
           return
       end
compute_something! (generic function with 1 method)

julia> @parallel function compute_something!(B, Ax, Ay)
           @inn(B) = @d2_xi(Ax) + @d2_yi(Ay)
           return
       end
compute_something! (generic function with 2 methods)

julia> nx = ny = 4
4

julia> # Test kernel with N = 1
       A = [@rand(nx,ny)]
1-element Vector{CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}}:
 [0.31322130418832184 0.9644593790849063 0.8421951080146028 0.8577772923657494; 0.5088175190632842 0.4177057769207575 0.4015573941471502 0.6532746859034653; 0.7671006039929464 0.8138802704821577 0.3195476513308886 0.21118995895483494; 0.9856881342175574 0.22716610365281564 0.03380085555816348 0.3556196653870136]

julia> B = @zeros(nx,ny)
4×4 CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0

julia> @parallel compute_something!(B, A...)

julia> # Compare with way as found in mini-apps
       A_x = copy(A[1])
4×4 CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
 0.313221  0.964459  0.842195   0.857777
 0.508818  0.417706  0.401557   0.653275
 0.767101  0.81388   0.319548   0.21119
 0.985688  0.227166  0.0338009  0.35562

julia> B_ref = @zeros(nx,ny)
4×4 CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0

julia> @parallel compute_something!(B_ref, A_x)

julia> B_ref ≈ B
true

julia> # Test kernel with N = 2
       A = [@rand(nx,ny), @rand(nx,ny)]
2-element Vector{CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}}:
 [0.11648030379108043 0.7016918423717395 0.3813365394157098 0.43472254813509736; 0.6678345880903644 0.8885012054871835 0.6491284039029075 0.8524135900243921; 0.749178219318515 0.3771530364220903 0.1633290416089188 0.6699294296751979; 0.22483500649211696 0.8619593996199939 0.5208627399150867 0.632580958632444]
 [0.1456817774423882 0.0599120126893542 0.6458892447881615 0.3584689704419577; 0.2664573828960062 0.5348710989611567 0.4653876225401121 0.931384898871823; 0.32879853389624825 0.010032647605189515 0.5631038579919612 0.10461129086618959; 0.9139882178251839 0.55668895067905 0.2144905738173768 0.03700448570805115]

julia> B = @zeros(nx,ny)
4×4 CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0

julia> @parallel compute_something!(B, A...)

julia> # Compare with way as found in mini-apps
       A_x = copy(A[1])
4×4 CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
 0.11648   0.701692  0.381337  0.434723
 0.667835  0.888501  0.649128  0.852414
 0.749178  0.377153  0.163329  0.669929
 0.224835  0.861959  0.520863  0.632581

julia> A_y = copy(A[2])
4×4 CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
 0.145682  0.059912   0.645889  0.358469
 0.266457  0.534871   0.465388  0.931385
 0.328799  0.0100326  0.563104  0.104611
 0.913988  0.556689   0.214491  0.0370045

julia> B_ref = @zeros(nx,ny)
4×4 CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0

julia> @parallel compute_something!(B_ref, A_x, A_y)

julia> B_ref ≈ B
true

Let us know if this approach suits you!

gerhardu · November 9, 2021, 8:29am

Thanks for your comprehensive answer!

That’s too bad, that non-primitive datatypes do not work with ParallelStencil!
In lattice simulations in high energy physics, non-primitive datatypes at lattice sites are very frequent (mostly matrices)…would have been great if ParallelStencil could have done it. Anyways, it works for my N component vector case:

About the x = @zeros(nx,ny,nz,N) implementation:
The array length should be a multiple of 32 for the GPU, right? Threads per Box?
Is this value independent of the datatype - same for Float32 or Float64?
My N will be less than 32, roughly at values of 1,4,8,16, therefore I guess I will go with the second approach with splatting these arrays and rewriting the kernel for various N.

thanks again!

samo · November 9, 2021, 1:49pm

That’s too bad, that non-primitive datatypes do not work with ParallelStencil!

As noted above, it’s not just that it does not work currently, but that I don’t think we should have it that way in ParallelStencil. I think it would be better to instead consider adding a new datatype for, e.g., vectors or tuples of arrays to ParallelStencil. Then, we could define macros for common operations with this new datatype.

Could you give an example of the mathematical expressions you would like to solve in such a kernel?
Can they mathematically be expressed the same way for each N, N being just a parameter, or is it mathmatically different (e.g. when going from 2-D to 3-D, we often have additional physics to consider) depending on N? In the latter case you will definitively want to write a separate kernel for each case, but in the first case, we could avoid having to write explicitly multiple kernels by creating some macros that take N as a parameter.

The array length should be a multiple of 32 for the GPU, right?

In general that will lead to a good memory access pattern (no matter if you use Float32 or Float64).

gerhardu · November 15, 2021, 2:06pm

Sry, i don’t understand what you mean, since in the first sentence it sounds like there is no way to do this. But in the next one you suggest there would be a possibility by adding a new datatype to ParallelStencil. Would this be something you could do, or me?

About the mathematical expressions:
In my case, where i have a N-component vector at each lattice site, I basically calculate in the kernel the “length” of this vector at each lattice site and use it for my update procedure of the N component vector.

Similarly other projects involve matrices at each lattice site (eg SU(N) matrices, or even vectors of SU(N) matrices), and the update procedure of this objects involves matrix multiplications.

As you can see, I can easily write an analytical expression for my “length” calculation of the N-component vector. …But matrix multiplications would be much more involved…

samo · November 23, 2021, 6:04pm

There is a subtle difference between data type you suggested originally and the one I think we should consider: you suggested to support new primitive datatype, i.e., leading, e.g., to arrays of vectors or tuples in the application; I think we should instead consider a datatype for, e.g., vectors or tuples of arrays. It is just a matter of how to store things in memory and for ParallelStencil, I think the latter is better. That said, yes, adding this new datatype is something that we could do.

Topic		Replies	Views
How to access field values in ParallelStencil.jl custom struct GPU question	2	68	November 28, 2024
Input data for main simulation with ParallelStencil.jl GPU input-output	3	325	October 25, 2023
Writing fast stencil computation kernels that work on both CPUs and GPUs GPU	3	2204	January 29, 2019
Hidden allocations when parallelizing with @batch Performance	13	342	March 5, 2024
New to Julia and Formatting StructArrays for GPU use with mutable scalars GPU staticarrays , structarrays	2	609	February 13, 2023

ParallelStencil for Vector Fields

Related topics