The following code is the computation kernel of a small heat diffusion application in 3-D:

```
function diffusion3D_kernel!(Te2, Te, Ci, lam, dt, dx, dy, dz)
_dx = 1.0/dx; # _dx could obviously be passed as argument instead of dx.
_dy = 1.0/dy;
_dz = 1.0/dz;
ix = (blockIdx().x-UInt32(1)) * blockDim().x + threadIdx().x # thread ID, dimension x
iy = (blockIdx().y-UInt32(1)) * blockDim().y + threadIdx().y # thread ID, dimension y
iz = (blockIdx().z-UInt32(1)) * blockDim().z + threadIdx().z # thread ID, dimension z
if (ix>1 && ix<size(Te2,1) && iy>1 && iy<size(Te2,2) && iz>1 && iz<size(Te2,3))
Te2[ix,iy,iz] = Te[ix,iy,iz] + dt*(Ci[ix,iy,iz]*(
- ((-lam*(Te[ix+1,iy,iz] - Te[ix,iy,iz])*_dx) - (-lam*(Te[ix,iy,iz] - Te[ix-1,iy,iz])*_dx))*_dx
- ((-lam*(Te[ix,iy+1,iz] - Te[ix,iy,iz])*_dy) - (-lam*(Te[ix,iy,iz] - Te[ix,iy-1,iz])*_dy))*_dy
- ((-lam*(Te[ix,iy,iz+1] - Te[ix,iy,iz])*_dz) - (-lam*(Te[ix,iy,iz] - Te[ix,iy,iz-1])*_dz))*_dz)
);
end
return nothing
end
```

The achieved performance is about **96%** of the corresponding CUDA C application from which it was one-to-one translated (run with `-math-mode=fast -O3 --check-bounds=no`

).

Now, I get **100%** of the performance of the CUDA C application if I ensure that the literals and `size()`

expressions have the same type as the other variables involved in the corresponding expressions. I.e., I cast all integer literals to `UInt32`

and replaced the calls to `size()`

directly with the corresponding `UInt32`

value (`threadIdx().x`

etc being of type `UInt32`

…):

```
function diffusion3D_kernel!(Te2, Te, Ci, lam, dt, dx, dy, dz)
_dx = 1.0/dx; # _dx could obviously be passed as argument instead of dx.
_dy = 1.0/dy; # ...
_dz = 1.0/dz; # ...
ix = (blockIdx().x-UInt32(1)) * blockDim().x + threadIdx().x # thread ID, dimension x
iy = (blockIdx().y-UInt32(1)) * blockDim().y + threadIdx().y # thread ID, dimension y
iz = (blockIdx().z-UInt32(1)) * blockDim().z + threadIdx().z # thread ID, dimension z
if (ix>UInt32(1) && ix<UInt32(512) && iy>UInt32(1) && iy<UInt32(512) && iz>UInt32(1) && iz<UInt32(512))
Te2[ix,iy,iz] = Te[ix,iy,iz] + dt*(Ci[ix,iy,iz]*(
- ((-lam*(Te[ix+UInt32(1),iy,iz] - Te[ix,iy,iz])*_dx) - (-lam*(Te[ix,iy,iz] - Te[ix-UInt32(1),iy,iz])*_dx))*_dx
- ((-lam*(Te[ix,iy+UInt32(1),iz] - Te[ix,iy,iz])*_dy) - (-lam*(Te[ix,iy,iz] - Te[ix,iy-UInt32(1),iz])*_dy))*_dy
- ((-lam*(Te[ix,iy,iz+UInt32(1)] - Te[ix,iy,iz])*_dz) - (-lam*(Te[ix,iy,iz] - Te[ix,iy,iz-UInt32(1)])*_dz))*_dz)
);
end
return nothing
end
```

Now, this is easily and quickly done in this simple small kernel; however, I would like to avoid doing this in the real-world application we are working on (in particular, because it is not nicely readable). BTW, in the real-world application, I would expect a larger speedup as I would estimate that the amount of index calculations is more important. Thus, my questions:

**Question 1:**

Can you suggest a simple and nice way to define the type for all integer/float literals within an application/kernel?

**Question 2:**

Is there a way to get `size()`

return value(s) of type `UInt32`

?

**Question 3:**

Is there any chance that Julia/CUDAnative will be able to choose automatically the right/best type for the literals in future?

Note:

- Float literals should be switchable from double to single precision (thus, using
`1.0`

for single precision and`1.0f0`

for single precision does not just do the trick) as the application works also in single precision. - I am aware that the calls to size could be made outside the kernel and then passed as kernel argument, yet this is not very convenient in the real-world application.
- Maybe metaprogramming is needed for that?
- In CUDA C
`threadIdx.x`

etc are of type int and integer literals are of type`int`

; thus, we do not usually run into such a problem with integers. - In CUDA C the standard literal being also of type double, we have the same problem as in Julia when we do single precision calculations (we need to cast to float or use the suffix F).

Thanks!!!