The following code is the computation kernel of a small heat diffusion application in 3-D:
function diffusion3D_kernel!(Te2, Te, Ci, lam, dt, dx, dy, dz)
_dx = 1.0/dx; # _dx could obviously be passed as argument instead of dx.
_dy = 1.0/dy;
_dz = 1.0/dz;
ix = (blockIdx().x-1) * blockDim().x + threadIdx().x # thread ID, dimension x
iy = (blockIdx().y-1) * blockDim().y + threadIdx().y # thread ID, dimension y
iz = (blockIdx().z-1) * blockDim().z + threadIdx().z # thread ID, dimension z
if (ix>1 && ix<size(Te2,1) && iy>1 && iy<size(Te2,2) && iz>1 && iz<size(Te2,3))
Te2[ix,iy,iz] = Te[ix,iy,iz] + dt*(Ci[ix,iy,iz]*(
- ((-lam*(Te[ix+1,iy,iz] - Te[ix,iy,iz])*_dx) - (-lam*(Te[ix,iy,iz] - Te[ix-1,iy,iz])*_dx))*_dx
- ((-lam*(Te[ix,iy+1,iz] - Te[ix,iy,iz])*_dy) - (-lam*(Te[ix,iy,iz] - Te[ix,iy-1,iz])*_dy))*_dy
- ((-lam*(Te[ix,iy,iz+1] - Te[ix,iy,iz])*_dz) - (-lam*(Te[ix,iy,iz] - Te[ix,iy,iz-1])*_dz))*_dz)
);
end
return nothing
end
The achieved performance is about 96% of the corresponding CUDA C application from which it was one-to-one translated (run with -math-mode=fast -O3 --check-bounds=no
).
Now, I get 100% of the performance of the CUDA C application if I ensure that the literals and size()
expressions have the same type as the other variables involved in the corresponding expressions. I.e., I cast all integer literals to UInt32
and replaced the calls to size()
directly with the corresponding UInt32
value (threadIdx().x
etc being of type UInt32
…):
function diffusion3D_kernel!(Te2, Te, Ci, lam, dt, dx, dy, dz)
_dx = 1.0/dx; # _dx could obviously be passed as argument instead of dx.
_dy = 1.0/dy; # ...
_dz = 1.0/dz; # ...
ix = (blockIdx().x-UInt32(1)) * blockDim().x + threadIdx().x # thread ID, dimension x
iy = (blockIdx().y-UInt32(1)) * blockDim().y + threadIdx().y # thread ID, dimension y
iz = (blockIdx().z-UInt32(1)) * blockDim().z + threadIdx().z # thread ID, dimension z
if (ix>UInt32(1) && ix<UInt32(512) && iy>UInt32(1) && iy<UInt32(512) && iz>UInt32(1) && iz<UInt32(512))
Te2[ix,iy,iz] = Te[ix,iy,iz] + dt*(Ci[ix,iy,iz]*(
- ((-lam*(Te[ix+UInt32(1),iy,iz] - Te[ix,iy,iz])*_dx) - (-lam*(Te[ix,iy,iz] - Te[ix-UInt32(1),iy,iz])*_dx))*_dx
- ((-lam*(Te[ix,iy+UInt32(1),iz] - Te[ix,iy,iz])*_dy) - (-lam*(Te[ix,iy,iz] - Te[ix,iy-UInt32(1),iz])*_dy))*_dy
- ((-lam*(Te[ix,iy,iz+UInt32(1)] - Te[ix,iy,iz])*_dz) - (-lam*(Te[ix,iy,iz] - Te[ix,iy,iz-UInt32(1)])*_dz))*_dz)
);
end
return nothing
end
Now, this is easily and quickly done in this simple small kernel; however, I would like to avoid doing this in the real-world application we are working on (in particular, because it is not nicely readable). BTW, in the real-world application, I would expect a larger speedup as I would estimate that the amount of index calculations is more important. Thus, my questions:
Question 1:
Can you suggest a simple and nice way to define the type for all integer/float literals within an application/kernel?
Question 2:
Is there a way to get size()
return value(s) of type UInt32
?
Question 3:
Is there any chance that Julia/CUDAnative will be able to choose automatically the right/best type for the literals in future?
Note:
- Float literals should be switchable from double to single precision (thus, using
1.0
for single precision and1.0f0
for single precision does not just do the trick) as the application works also in single precision. - I am aware that the calls to size could be made outside the kernel and then passed as kernel argument, yet this is not very convenient in the real-world application.
- Maybe metaprogramming is needed for that?
- In CUDA C
threadIdx.x
etc are of type int and integer literals are of typeint
; thus, we do not usually run into such a problem with integers. - In CUDA C the standard literal being also of type double, we have the same problem as in Julia when we do single precision calculations (we need to cast to float or use the suffix F).
Thanks!!!