Use GPU subfunction in a bigger-function?

Hello!

My question title is properly worded poorly, but here goes. I have made a “subfunction”, called gpu_ParticlePackStep! which calculates properties for one particle. If I call this function as such in Julia:

gpu_ParticlePackStep!(pg,pg_tmp,u,u_tmp,idxs,3)

Then it works, since something happens in the last row i.e.:

My problem is now if I want to use multiple threads, I make a code as:

function gpu_PackStep!(pg,pg_tmp,u,u_tmp,idxs)
    index = threadIdx().x
    stride = blockDim().x
    for iter = index:stride:length(pg)
        @inbounds gpu_ParticlePackStep!(pg,pg_tmp,u,u_tmp,idxs[iter],iter)
    end
    return nothing
end

I call it using;

@cuda threads=3 gpu_PackStep!(pg,pg_tmp,u,u_tmp,idxs)

But it produces an error as:

ERROR: GPU compilation of gpu_PackStep!(CuDeviceArray{Tuple{Float32,Float32,Float32},1,CUDAnative.AS.Global}, CuDeviceArray{Tuple{Float32,Float32,Float32},1,CUDAnative.AS.Global}, CuDeviceArray{Tuple{Float32,Float32,Float32},1,CUDAnative.AS.Global}, CuDeviceArray{Tuple{Float32,Float32,Float32},1,CUDAnative.AS.Global}, Array{Array{Int64,1},1}) failed
KernelError: passing and using non-bitstype argument
Argument 6 to your kernel function is of type Array{Array{Int64,1},1}.
That type is not isbits, and such arguments are only allowed when they are unused by the kernel.

The warning seems pretty clear, but I don’t understand it, especially why it lets me run the subfunction, but not the main function. The full example code is below, and works out of the box on Julia v1.4:

using CuArrays
using CUDAnative

# Random constants
const H = 0.04
const H1   = 1/H;
const AD = 348.15;
const FAC = 5/8;
const BETA = 4;
const ZETA = 0.060006;
const V    = 0.0011109;
const DT   = 0.016665;

# Generate random points / code
N = 3;
pg = CuArrays.fill(tuple(0.f0,0.f0,0.f0), N);
pg_tmp = deepcopy(pg)
u  = CuArrays.fill(tuple(0.f0,0.f0,0.f0), N);
u_tmp = deepcopy(pg)

# Arbitrary ID's
idxs  = [[3,2],[3,1],[2,1]]

# Calculate for one particle - NOTE idxs[iter]
function gpu_ParticlePackStep!(pg,pg_tmp,u,u_tmp,idxs,iter)
    Wgx = 0.f0;
    Wgz = 0.f0;

    filter!(x->x≠iter,idxs)
    @inbounds for i in idxs
            p_j = pg[iter] .- pg[i]
            RIJ  = sqrt(sum(abs2,p_j.^2))
            RIJ1 = 1.f0 / RIJ
            q   = RIJ*H1;
            qq3 = q*(q-2)^3;
            Wq  = AD * FAC * qq3;

            x_ij = p_j[1];
            z_ij = p_j[3];

            Wgx += Wq * (x_ij * RIJ1) * H1;
            Wgz += Wq * (z_ij * RIJ1) * H;
        end

        u_i = u[iter];
        dux = (-BETA * Wgx * V - ZETA * u_i[1])*DT;
        duz = (-BETA * Wgz * V - ZETA * u_i[3])*DT;
        dx  = dux*DT;
        dz  = duz*DT;
        u_tmp[iter]   =   u_i      .+ (dux, 0.0, duz)
        pg_tmp[iter]  =   pg[iter] .+ (dx,  0.0, dz)

    return nothing
end

# Do it for a lot of particles..
# Errors: isbit type
function gpu_PackStep!(pg,pg_tmp,u,u_tmp,idxs)
    index = threadIdx().x
    stride = blockDim().x
    for iter = index:stride:length(pg)
        @inbounds gpu_ParticlePackStep!(pg,pg_tmp,u,u_tmp,idxs[iter],iter)
    end
    return nothing
end

I know this is not the best way to do GPU programming, but I have to start somewhere, so this is why I have written such a poor kernel. I hope someone can spot where I am going wrong, in regards to getting this to work.

Kind regards

I think I figured it out. You are not permitted to send an array of arrays as input, but you can transfer an array still. Just be sure that it is actually a CuArray, either by cudaconvert.

Also you cannot use “filter!” on GPU currently, which was a second error.

Kind regards

Your first example gpu_ParticlePackStep!(pg,pg_tmp,u,u_tmp,idxs,3) is not executed on the GPU,
but rather on the CPU. Set CuArrays.allowscalar(false) to turn that fallback behaviour into an error.

You figured out the second part of your question.

Could you elaborate on why it is not run on GPU? And can you tell me how to ensure, my GPU is used?

Kind regards

Code is only run on the GPU when you explicitly invoke it there with @cuda. Just calling a function operating on a CuArray does not guarantee that the work will be done on the GPU. Many array collectives like broadcast or linear algebra operations will, but code that loops over the indices of an an array will not.

CuArrays.allowscalar(false), will disallow slow operations like indexing GPU memory on the CPU highlighting issues like the above.

Ah okay I am a bit confused then with my final code, but will have to dive deeper into it then. Thanks for providing an answer.

Kind regards