CUDAnative: simple and nice way to define type for all integer/float literals?; size() returning UInt32?

The following code is the computation kernel of a small heat diffusion application in 3-D:

function diffusion3D_kernel!(Te2, Te, Ci, lam, dt, dx, dy, dz)
    _dx = 1.0/dx; # _dx could obviously be passed as argument instead of dx.
    _dy = 1.0/dy;
    _dz = 1.0/dz;
    ix = (blockIdx().x-UInt32(1)) * blockDim().x + threadIdx().x # thread ID, dimension x
    iy = (blockIdx().y-UInt32(1)) * blockDim().y + threadIdx().y # thread ID, dimension y
    iz = (blockIdx().z-UInt32(1)) * blockDim().z + threadIdx().z # thread ID, dimension z
    if (ix>1 && ix<size(Te2,1) && iy>1 && iy<size(Te2,2) && iz>1 && iz<size(Te2,3))
        Te2[ix,iy,iz] = Te[ix,iy,iz] + dt*(Ci[ix,iy,iz]*(
                        - ((-lam*(Te[ix+1,iy,iz] - Te[ix,iy,iz])*_dx) - (-lam*(Te[ix,iy,iz] - Te[ix-1,iy,iz])*_dx))*_dx
                        - ((-lam*(Te[ix,iy+1,iz] - Te[ix,iy,iz])*_dy) - (-lam*(Te[ix,iy,iz] - Te[ix,iy-1,iz])*_dy))*_dy
                        - ((-lam*(Te[ix,iy,iz+1] - Te[ix,iy,iz])*_dz) - (-lam*(Te[ix,iy,iz] - Te[ix,iy,iz-1])*_dz))*_dz)
                        );
    end
    return nothing
end

The achieved performance is about 96% of the corresponding CUDA C application from which it was one-to-one translated (run with -math-mode=fast -O3 --check-bounds=no).

Now, I get 100% of the performance of the CUDA C application if I ensure that the literals and size() expressions have the same type as the other variables involved in the corresponding expressions. I.e., I cast all integer literals to UInt32 and replaced the calls to size() directly with the corresponding UInt32 value (threadIdx().x etc being of type UInt32…):

function diffusion3D_kernel!(Te2, Te, Ci, lam, dt, dx, dy, dz)
    _dx = 1.0/dx;  # _dx could obviously be passed as argument instead of dx.
    _dy = 1.0/dy;  # ...
    _dz = 1.0/dz;  # ...
    ix = (blockIdx().x-UInt32(1)) * blockDim().x + threadIdx().x # thread ID, dimension x
    iy = (blockIdx().y-UInt32(1)) * blockDim().y + threadIdx().y # thread ID, dimension y
    iz = (blockIdx().z-UInt32(1)) * blockDim().z + threadIdx().z # thread ID, dimension z
    if (ix>UInt32(1) && ix<UInt32(512) && iy>UInt32(1) && iy<UInt32(512) && iz>UInt32(1) && iz<UInt32(512))
            Te2[ix,iy,iz] = Te[ix,iy,iz] + dt*(Ci[ix,iy,iz]*(
                            - ((-lam*(Te[ix+UInt32(1),iy,iz] - Te[ix,iy,iz])*_dx) - (-lam*(Te[ix,iy,iz] - Te[ix-UInt32(1),iy,iz])*_dx))*_dx
                            - ((-lam*(Te[ix,iy+UInt32(1),iz] - Te[ix,iy,iz])*_dy) - (-lam*(Te[ix,iy,iz] - Te[ix,iy-UInt32(1),iz])*_dy))*_dy
                            - ((-lam*(Te[ix,iy,iz+UInt32(1)] - Te[ix,iy,iz])*_dz) - (-lam*(Te[ix,iy,iz] - Te[ix,iy,iz-UInt32(1)])*_dz))*_dz)
                            );
    end
    return nothing
end

Now, this is easily and quickly done in this simple small kernel; however, I would like to avoid doing this in the real-world application we are working on (in particular, because it is not nicely readable). BTW, in the real-world application, I would expect a larger speedup as I would estimate that the amount of index calculations is more important. Thus, my questions:

Question 1:
Can you suggest a simple and nice way to define the type for all integer/float literals within an application/kernel?

Question 2:
Is there a way to get size() return value(s) of type UInt32?

Question 3:
Is there any chance that Julia/CUDAnative will be able to choose automatically the right/best type for the literals in future?

Note:

  1. Float literals should be switchable from double to single precision (thus, using 1.0 for single precision and 1.0f0 for single precision does not just do the trick) as the application works also in single precision.
  2. I am aware that the calls to size could be made outside the kernel and then passed as kernel argument, yet this is not very convenient in the real-world application.
  3. Maybe metaprogramming is needed for that?
  4. In CUDA C threadIdx.x etc are of type int and integer literals are of type int; thus, we do not usually run into such a problem with integers.
  5. In CUDA C the standard literal being also of type double, we have the same problem as in Julia when we do single precision calculations (we need to cast to float or use the suffix F).

Thanks!!!

ChangePrecision.jl comes to mind, but we don’t have a proper solution for changing the behavior of Int. I’m not sure it’s worth a lot of engineering time though, seeing how it only matters 4% of performance in your indexing-heavy kernel.

No. We used to have that (together with intrinsics like threadIdx returning 32-bit integers), but with 64-bit literals that quickly resulted in more expensive and often checked conversions all over the place. I’ve also heard stories of overflowing 32-bit array indices in real-life code, so it might be safer to just stick to 64-bit (even though we miss out on the Volta/Turing concurrent I32/FP execution).

Cassette might open some doors, e.g. https://github.com/JuliaGPU/CUDAnative.jl/issues/25#issuecomment-359155011, so you could have a look if you’re interested in that.

Thanks @maleadt.

Question 1
ChangePrecision.jl inspired me to do my first little metaprogramming activity using Julia’s abstract syntax tree :slightly_smiling_face:. The following macro enables to change the types of the integer and float literals of an expression at parse time:

macro literaltypes(int_type, float_type, expr) esc(literaltypes(eval(int_type), eval(float_type), expr))  end

function literaltypes(int_type::DataType, float_type::DataType, expr::Expr)
	args = expr.args
	for i=1:length(args)
		if typeof(args[i]) <: Integer
			literal = int_type((args[i]))
			args[i] = :($literal)
		elseif typeof(args[i]) <: AbstractFloat
			literal = float_type((args[i]))
			args[i] = :($literal)
		elseif typeof(args[i]) == Expr
			args[i] = literaltypes(int_type, float_type, args[i])
		end
	end
	return expr
end

See here a short example:

julia> @macroexpand @literaltypes UInt32 Float32 begin (i+1)+(i+2); 7.8+1.3 end
quote
    #= REPL[109]:1 =#
    (i + 0x00000001) + (i + 0x00000002)
    #= REPL[109]:1 =#
    7.8f0 + 1.3f0
end

And here the result when applied to the diffusion kernel:

julia> @macroexpand @literaltypes UInt32 Float64 function diffusion3D_kernel!(Te2, Te, Ci, lam, dt, dx, dy, dz)
           _dx = 1.0/dx;
           _dy = 1.0/dy;
           _dz = 1.0/dz;
               ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
               iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
               iz = (blockIdx().z-1) * blockDim().z + threadIdx().z
           if (ix>1 && ix<size(Te2,1) && iy>1 && iy<size(Te2,2) && iz>1 && iz<size(Te2,3))
               Te2[ix,iy,iz] = Te[ix,iy,iz] + dt*(Ci[ix,iy,iz]*(
                               - ((-lam*(Te[ix+1,iy,iz] - Te[ix,iy,iz])*_dx) - (-lam*(Te[ix,iy,iz] - Te[ix-1,iy,iz])*_dx))*_dx
                               - ((-lam*(Te[ix,iy+1,iz] - Te[ix,iy,iz])*_dy) - (-lam*(Te[ix,iy,iz] - Te[ix,iy-1,iz])*_dy))*_dy
                               - ((-lam*(Te[ix,iy,iz+1] - Te[ix,iy,iz])*_dz) - (-lam*(Te[ix,iy,iz] - Te[ix,iy,iz-1])*_dz))*_dz)
                               );
           end
           return nothing
       end
:(function diffusion3D_kernel!(Te2, Te, Ci, lam, dt, dx, dy, dz)
      #= REPL[118]:2 =#
      _dx = 1.0 / dx
      #= REPL[118]:3 =#
      _dy = 1.0 / dy
      #= REPL[118]:4 =#
      _dz = 1.0 / dz
      #= REPL[118]:5 =#
      ix = ((blockIdx()).x - 0x00000001) * (blockDim()).x + (threadIdx()).x
      #= REPL[118]:6 =#
      iy = ((blockIdx()).y - 0x00000001) * (blockDim()).y + (threadIdx()).y
      #= REPL[118]:7 =#
      iz = ((blockIdx()).z - 0x00000001) * (blockDim()).z + (threadIdx()).z
      #= REPL[118]:8 =#
      if ix > 0x00000001 && (ix < size(Te2, 0x00000001) && (iy > 0x00000001 && (iy < size(Te2, 0x00000002) && (iz > 0x00000001 && iz < size(Te2, 0x00000003)))))
          #= REPL[118]:9 =#
          Te2[ix, iy, iz] = Te[ix, iy, iz] + dt * (Ci[ix, iy, iz] * ((-((-lam * (Te[ix + 0x00000001, iy, iz] - Te[ix, iy, iz]) * _dx - -lam * (Te[ix, iy, iz] - Te[ix - 0x00000001, iy, iz]) * _dx)) * _dx - (-lam * (Te[ix, iy + 0x00000001, iz] - Te[ix, iy, iz]) * _dy - -lam * (Te[ix, iy, iz] - Te[ix, iy - 0x00000001, iz]) * _dy) * _dy) - (-lam * (Te[ix, iy, iz + 0x00000001] - Te[ix, iy, iz]) * _dz - -lam * (Te[ix, iy, iz] - Te[ix, iy, iz - 0x00000001]) * _dz) * _dz))
      end
      #= REPL[118]:15 =#
      return nothing
  end)

When I use this macro and I replace manually the calls to size() with the corresponding literal, then I get 100% of the CUDA C performance. Thus, I have a solution for question 1. However, I am still missing a solution for question 2 (“Is there a way to get size() return value(s) of type UInt32?”).

Question 2

So your answer to question 2 was:

No. We used to have that (together with intrinsics like threadIdx returning 32-bit integers), but with 64-bit literals that quickly resulted in more expensive and often checked conversions all over the place. I’ve also heard stories of overflowing 32-bit array indices in real-life code, so it might be safer to just stick to 64-bit (even though we miss out on the Volta/Turing concurrent I32/FP execution).

I totally agree that it is safer to have the 64-bit version as default; yet, it would be nice if we could change it when we are sure that it is no problem.
Is there no way to overload size() or arraysize() (which is called by size()) to achieve that? I looked in the Julia source code, but could not get down to where one would define the return type…

Question 3
Your answer to question 3 was:

Cassette might open some doors, e.g. https://github.com/JuliaGPU/CUDAnative.jl/issues/25#issuecomment-359155011, so you could have a look if you’re interested in that.

It would be nice if something like that could be fully integrated to Julia. Good to see this might become real.

Luckily, the required engineering time to define the type of the literals with a macro was very short for the solution I shared above. Our real-world application achieves currently 90% of the corresponding CUDA C application and I believe much of this difference will disappear if we can solve the questions asked here. I agree that normally it is not worth spending a lot of time on a few percent of performance - one typically better uses this time to improve the numerical algorithm - but, as our application is a showcase to convince people of Julia and CuArrays/CUDAnative/CUDAdrv, every percentage of performance difference with CUDA C might be of importance in the end. :wink:

@maleadt: The performance of the realworld application increased by 5% when I changed the integer literals to UInt32 (using the marco @literaltypes shared above) and replaced the calls to size() by calls to @size(), where @size() is a macro that returns the sizes as UInt32. Now, the macro @size() I hacked is of course not at all robust and not a good general solution. I share it here anyway for completeness:

macro size(A, d) sizes_A=Symbol("sizes_",A); esc(eval(:($sizes_A[$d]))) end
macro size(A)    sizes_A=Symbol("sizes_",A); esc(eval(:($sizes_A))) end
@literaltypes UInt32 Float64 sizes_Te2      = (512,512,512)
... # Here we got a declaration for each array...

In conclusion: 5% performance increase in a realworld production application obtained just by changing the types of the integer literals and the return type of (@)size is certainly not negligible. I believe therefore that it would be good to give the Julia user (again) the possibility to get 32-bit return values from calls to size if he desires.

@maleadt you wrote:

We used to have that (together with intrinsics like threadIdx returning 32-bit integers)

Was this part of CUDAnative or of Julia itself? If it was part of CUDAnative, could you add (again) the possibility to get a UInt32 from a size() call? I don’t suggest to add it as default behaviour - at least not as long as the literals are 64-bit by default…

It’s not as simple as changing the size function, the rest of the infrastructure has to follow suit (intrinsics returning 32-bit integers, every -1 now needs to check the index type as well, etc). Can’t you try and compile a 32-bit version of Julia, where Int == Int32, and avoid having to make any changes to CUDAnative?