 # CUDAnative: simple and nice way to define type for all integer/float literals?; size() returning UInt32?

The following code is the computation kernel of a small heat diffusion application in 3-D:

``````function diffusion3D_kernel!(Te2, Te, Ci, lam, dt, dx, dy, dz)
_dx = 1.0/dx; # _dx could obviously be passed as argument instead of dx.
_dy = 1.0/dy;
_dz = 1.0/dz;
ix = (blockIdx().x-1) * blockDim().x + threadIdx().x # thread ID, dimension x
iy = (blockIdx().y-1) * blockDim().y + threadIdx().y # thread ID, dimension y
iz = (blockIdx().z-1) * blockDim().z + threadIdx().z # thread ID, dimension z
if (ix>1 && ix<size(Te2,1) && iy>1 && iy<size(Te2,2) && iz>1 && iz<size(Te2,3))
Te2[ix,iy,iz] = Te[ix,iy,iz] + dt*(Ci[ix,iy,iz]*(
- ((-lam*(Te[ix+1,iy,iz] - Te[ix,iy,iz])*_dx) - (-lam*(Te[ix,iy,iz] - Te[ix-1,iy,iz])*_dx))*_dx
- ((-lam*(Te[ix,iy+1,iz] - Te[ix,iy,iz])*_dy) - (-lam*(Te[ix,iy,iz] - Te[ix,iy-1,iz])*_dy))*_dy
- ((-lam*(Te[ix,iy,iz+1] - Te[ix,iy,iz])*_dz) - (-lam*(Te[ix,iy,iz] - Te[ix,iy,iz-1])*_dz))*_dz)
);
end
return nothing
end
``````

The achieved performance is about 96% of the corresponding CUDA C application from which it was one-to-one translated (run with `-math-mode=fast -O3 --check-bounds=no`).

Now, I get 100% of the performance of the CUDA C application if I ensure that the literals and `size()` expressions have the same type as the other variables involved in the corresponding expressions. I.e., I cast all integer literals to `UInt32` and replaced the calls to `size()` directly with the corresponding `UInt32` value (`threadIdx().x` etc being of type `UInt32`…):

``````function diffusion3D_kernel!(Te2, Te, Ci, lam, dt, dx, dy, dz)
_dx = 1.0/dx;  # _dx could obviously be passed as argument instead of dx.
_dy = 1.0/dy;  # ...
_dz = 1.0/dz;  # ...
ix = (blockIdx().x-UInt32(1)) * blockDim().x + threadIdx().x # thread ID, dimension x
iy = (blockIdx().y-UInt32(1)) * blockDim().y + threadIdx().y # thread ID, dimension y
iz = (blockIdx().z-UInt32(1)) * blockDim().z + threadIdx().z # thread ID, dimension z
if (ix>UInt32(1) && ix<UInt32(512) && iy>UInt32(1) && iy<UInt32(512) && iz>UInt32(1) && iz<UInt32(512))
Te2[ix,iy,iz] = Te[ix,iy,iz] + dt*(Ci[ix,iy,iz]*(
- ((-lam*(Te[ix+UInt32(1),iy,iz] - Te[ix,iy,iz])*_dx) - (-lam*(Te[ix,iy,iz] - Te[ix-UInt32(1),iy,iz])*_dx))*_dx
- ((-lam*(Te[ix,iy+UInt32(1),iz] - Te[ix,iy,iz])*_dy) - (-lam*(Te[ix,iy,iz] - Te[ix,iy-UInt32(1),iz])*_dy))*_dy
- ((-lam*(Te[ix,iy,iz+UInt32(1)] - Te[ix,iy,iz])*_dz) - (-lam*(Te[ix,iy,iz] - Te[ix,iy,iz-UInt32(1)])*_dz))*_dz)
);
end
return nothing
end
``````

Now, this is easily and quickly done in this simple small kernel; however, I would like to avoid doing this in the real-world application we are working on (in particular, because it is not nicely readable). BTW, in the real-world application, I would expect a larger speedup as I would estimate that the amount of index calculations is more important. Thus, my questions:

Question 1:
Can you suggest a simple and nice way to define the type for all integer/float literals within an application/kernel?

Question 2:
Is there a way to get `size()` return value(s) of type `UInt32`?

Question 3:
Is there any chance that Julia/CUDAnative will be able to choose automatically the right/best type for the literals in future?

Note:

1. Float literals should be switchable from double to single precision (thus, using `1.0` for single precision and `1.0f0` for single precision does not just do the trick) as the application works also in single precision.
2. I am aware that the calls to size could be made outside the kernel and then passed as kernel argument, yet this is not very convenient in the real-world application.
3. Maybe metaprogramming is needed for that?
4. In CUDA C `threadIdx.x` etc are of type int and integer literals are of type `int`; thus, we do not usually run into such a problem with integers.
5. In CUDA C the standard literal being also of type double, we have the same problem as in Julia when we do single precision calculations (we need to cast to float or use the suffix F).

Thanks!!!

ChangePrecision.jl comes to mind, but we don’t have a proper solution for changing the behavior of Int. I’m not sure it’s worth a lot of engineering time though, seeing how it only matters 4% of performance in your indexing-heavy kernel.

No. We used to have that (together with intrinsics like `threadIdx` returning 32-bit integers), but with 64-bit literals that quickly resulted in more expensive and often checked conversions all over the place. I’ve also heard stories of overflowing 32-bit array indices in real-life code, so it might be safer to just stick to 64-bit (even though we miss out on the Volta/Turing concurrent I32/FP execution).

Cassette might open some doors, e.g. https://github.com/JuliaGPU/CUDAnative.jl/issues/25#issuecomment-359155011, so you could have a look if you’re interested in that.

Question 1
`ChangePrecision.jl` inspired me to do my first little metaprogramming activity using Julia’s abstract syntax tree . The following macro enables to change the types of the integer and float literals of an expression at parse time:

``````macro literaltypes(int_type, float_type, expr) esc(literaltypes(eval(int_type), eval(float_type), expr))  end

function literaltypes(int_type::DataType, float_type::DataType, expr::Expr)
args = expr.args
for i=1:length(args)
if typeof(args[i]) <: Integer
literal = int_type((args[i]))
args[i] = :(\$literal)
elseif typeof(args[i]) <: AbstractFloat
literal = float_type((args[i]))
args[i] = :(\$literal)
elseif typeof(args[i]) == Expr
args[i] = literaltypes(int_type, float_type, args[i])
end
end
return expr
end
``````

See here a short example:

``````julia> @macroexpand @literaltypes UInt32 Float32 begin (i+1)+(i+2); 7.8+1.3 end
quote
#= REPL:1 =#
(i + 0x00000001) + (i + 0x00000002)
#= REPL:1 =#
7.8f0 + 1.3f0
end
``````

And here the result when applied to the diffusion kernel:

``````julia> @macroexpand @literaltypes UInt32 Float64 function diffusion3D_kernel!(Te2, Te, Ci, lam, dt, dx, dy, dz)
_dx = 1.0/dx;
_dy = 1.0/dy;
_dz = 1.0/dz;
ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
iz = (blockIdx().z-1) * blockDim().z + threadIdx().z
if (ix>1 && ix<size(Te2,1) && iy>1 && iy<size(Te2,2) && iz>1 && iz<size(Te2,3))
Te2[ix,iy,iz] = Te[ix,iy,iz] + dt*(Ci[ix,iy,iz]*(
- ((-lam*(Te[ix+1,iy,iz] - Te[ix,iy,iz])*_dx) - (-lam*(Te[ix,iy,iz] - Te[ix-1,iy,iz])*_dx))*_dx
- ((-lam*(Te[ix,iy+1,iz] - Te[ix,iy,iz])*_dy) - (-lam*(Te[ix,iy,iz] - Te[ix,iy-1,iz])*_dy))*_dy
- ((-lam*(Te[ix,iy,iz+1] - Te[ix,iy,iz])*_dz) - (-lam*(Te[ix,iy,iz] - Te[ix,iy,iz-1])*_dz))*_dz)
);
end
return nothing
end
:(function diffusion3D_kernel!(Te2, Te, Ci, lam, dt, dx, dy, dz)
#= REPL:2 =#
_dx = 1.0 / dx
#= REPL:3 =#
_dy = 1.0 / dy
#= REPL:4 =#
_dz = 1.0 / dz
#= REPL:5 =#
ix = ((blockIdx()).x - 0x00000001) * (blockDim()).x + (threadIdx()).x
#= REPL:6 =#
iy = ((blockIdx()).y - 0x00000001) * (blockDim()).y + (threadIdx()).y
#= REPL:7 =#
iz = ((blockIdx()).z - 0x00000001) * (blockDim()).z + (threadIdx()).z
#= REPL:8 =#
if ix > 0x00000001 && (ix < size(Te2, 0x00000001) && (iy > 0x00000001 && (iy < size(Te2, 0x00000002) && (iz > 0x00000001 && iz < size(Te2, 0x00000003)))))
#= REPL:9 =#
Te2[ix, iy, iz] = Te[ix, iy, iz] + dt * (Ci[ix, iy, iz] * ((-((-lam * (Te[ix + 0x00000001, iy, iz] - Te[ix, iy, iz]) * _dx - -lam * (Te[ix, iy, iz] - Te[ix - 0x00000001, iy, iz]) * _dx)) * _dx - (-lam * (Te[ix, iy + 0x00000001, iz] - Te[ix, iy, iz]) * _dy - -lam * (Te[ix, iy, iz] - Te[ix, iy - 0x00000001, iz]) * _dy) * _dy) - (-lam * (Te[ix, iy, iz + 0x00000001] - Te[ix, iy, iz]) * _dz - -lam * (Te[ix, iy, iz] - Te[ix, iy, iz - 0x00000001]) * _dz) * _dz))
end
#= REPL:15 =#
return nothing
end)
``````

When I use this macro and I replace manually the calls to `size()` with the corresponding literal, then I get 100% of the CUDA C performance. Thus, I have a solution for question 1. However, I am still missing a solution for question 2 (“Is there a way to get size() return value(s) of type UInt32?”).

Question 2

So your answer to question 2 was:

No. We used to have that (together with intrinsics like threadIdx returning 32-bit integers), but with 64-bit literals that quickly resulted in more expensive and often checked conversions all over the place. I’ve also heard stories of overflowing 32-bit array indices in real-life code, so it might be safer to just stick to 64-bit (even though we miss out on the Volta/Turing concurrent I32/FP execution).

I totally agree that it is safer to have the 64-bit version as default; yet, it would be nice if we could change it when we are sure that it is no problem.
Is there no way to overload `size()` or `arraysize()` (which is called by `size()`) to achieve that? I looked in the Julia source code, but could not get down to where one would define the return type…

Question 3

Cassette might open some doors, e.g. https://github.com/JuliaGPU/CUDAnative.jl/issues/25#issuecomment-359155011, so you could have a look if you’re interested in that.

It would be nice if something like that could be fully integrated to Julia. Good to see this might become real.

Luckily, the required engineering time to define the type of the literals with a macro was very short for the solution I shared above. Our real-world application achieves currently 90% of the corresponding CUDA C application and I believe much of this difference will disappear if we can solve the questions asked here. I agree that normally it is not worth spending a lot of time on a few percent of performance - one typically better uses this time to improve the numerical algorithm - but, as our application is a showcase to convince people of Julia and `CuArrays/CUDAnative/CUDAdrv`, every percentage of performance difference with CUDA C might be of importance in the end. @maleadt: The performance of the realworld application increased by 5% when I changed the integer literals to `UInt32` (using the marco `@literaltypes` shared above) and replaced the calls to `size()` by calls to `@size()`, where `@size()` is a macro that returns the sizes as `UInt32`. Now, the macro `@size()` I hacked is of course not at all robust and not a good general solution. I share it here anyway for completeness:

``````macro size(A, d) sizes_A=Symbol("sizes_",A); esc(eval(:(\$sizes_A[\$d]))) end
macro size(A)    sizes_A=Symbol("sizes_",A); esc(eval(:(\$sizes_A))) end
@literaltypes UInt32 Float64 sizes_Te2      = (512,512,512)
... # Here we got a declaration for each array...
``````

In conclusion: 5% performance increase in a realworld production application obtained just by changing the types of the integer literals and the return type of `(@)size` is certainly not negligible. I believe therefore that it would be good to give the Julia user (again) the possibility to get 32-bit return values from calls to `size` if he desires.

Was this part of `CUDAnative` or of Julia itself? If it was part of `CUDAnative`, could you add (again) the possibility to get a `UInt32` from a `size()` call? I don’t suggest to add it as default behaviour - at least not as long as the literals are 64-bit by default…
It’s not as simple as changing the `size` function, the rest of the infrastructure has to follow suit (intrinsics returning 32-bit integers, every `-1` now needs to check the index type as well, etc). Can’t you try and compile a 32-bit version of Julia, where `Int == Int32`, and avoid having to make any changes to CUDAnative?