Hello!
My question title is properly worded poorly, but here goes. I have made a “subfunction”, called gpu_ParticlePackStep!
which calculates properties for one particle. If I call this function as such in Julia:
gpu_ParticlePackStep!(pg,pg_tmp,u,u_tmp,idxs,3)
Then it works, since something happens in the last row i.e.:
My problem is now if I want to use multiple threads, I make a code as:
function gpu_PackStep!(pg,pg_tmp,u,u_tmp,idxs)
index = threadIdx().x
stride = blockDim().x
for iter = index:stride:length(pg)
@inbounds gpu_ParticlePackStep!(pg,pg_tmp,u,u_tmp,idxs[iter],iter)
end
return nothing
end
I call it using;
@cuda threads=3 gpu_PackStep!(pg,pg_tmp,u,u_tmp,idxs)
But it produces an error as:
ERROR: GPU compilation of gpu_PackStep!(CuDeviceArray{Tuple{Float32,Float32,Float32},1,CUDAnative.AS.Global}, CuDeviceArray{Tuple{Float32,Float32,Float32},1,CUDAnative.AS.Global}, CuDeviceArray{Tuple{Float32,Float32,Float32},1,CUDAnative.AS.Global}, CuDeviceArray{Tuple{Float32,Float32,Float32},1,CUDAnative.AS.Global}, Array{Array{Int64,1},1}) failed
KernelError: passing and using non-bitstype argument
Argument 6 to your kernel function is of type Array{Array{Int64,1},1}.
That type is not isbits, and such arguments are only allowed when they are unused by the kernel.
The warning seems pretty clear, but I don’t understand it, especially why it lets me run the subfunction, but not the main function. The full example code is below, and works out of the box on Julia v1.4:
using CuArrays
using CUDAnative
# Random constants
const H = 0.04
const H1 = 1/H;
const AD = 348.15;
const FAC = 5/8;
const BETA = 4;
const ZETA = 0.060006;
const V = 0.0011109;
const DT = 0.016665;
# Generate random points / code
N = 3;
pg = CuArrays.fill(tuple(0.f0,0.f0,0.f0), N);
pg_tmp = deepcopy(pg)
u = CuArrays.fill(tuple(0.f0,0.f0,0.f0), N);
u_tmp = deepcopy(pg)
# Arbitrary ID's
idxs = [[3,2],[3,1],[2,1]]
# Calculate for one particle - NOTE idxs[iter]
function gpu_ParticlePackStep!(pg,pg_tmp,u,u_tmp,idxs,iter)
Wgx = 0.f0;
Wgz = 0.f0;
filter!(x->x≠iter,idxs)
@inbounds for i in idxs
p_j = pg[iter] .- pg[i]
RIJ = sqrt(sum(abs2,p_j.^2))
RIJ1 = 1.f0 / RIJ
q = RIJ*H1;
qq3 = q*(q-2)^3;
Wq = AD * FAC * qq3;
x_ij = p_j[1];
z_ij = p_j[3];
Wgx += Wq * (x_ij * RIJ1) * H1;
Wgz += Wq * (z_ij * RIJ1) * H;
end
u_i = u[iter];
dux = (-BETA * Wgx * V - ZETA * u_i[1])*DT;
duz = (-BETA * Wgz * V - ZETA * u_i[3])*DT;
dx = dux*DT;
dz = duz*DT;
u_tmp[iter] = u_i .+ (dux, 0.0, duz)
pg_tmp[iter] = pg[iter] .+ (dx, 0.0, dz)
return nothing
end
# Do it for a lot of particles..
# Errors: isbit type
function gpu_PackStep!(pg,pg_tmp,u,u_tmp,idxs)
index = threadIdx().x
stride = blockDim().x
for iter = index:stride:length(pg)
@inbounds gpu_ParticlePackStep!(pg,pg_tmp,u,u_tmp,idxs[iter],iter)
end
return nothing
end
I know this is not the best way to do GPU programming, but I have to start somewhere, so this is why I have written such a poor kernel. I hope someone can spot where I am going wrong, in regards to getting this to work.
Kind regards