Are there any benchmarks between C++ cuda and Cuda.jl?
We did a comparison against CUDA C with the Rodinia benchmark suite when originally developing CUDA.jl, and the results were good: kernels written in Julia, in the same style as how you would write kernels in C, performs on average pretty much the same. The paper can be found here, but it’s a couple years old.
There are a couple of things to watch out for: in Julia, array accesses (and many other operations) can throw, introducing additional blocks in the generated code. These can be avoided by e.g. using @inbounds
being careful about value conversions. Julia also defaults to 64-bit integers, which can make literals and address calculations consume more registers. This generally only matters when micro-optimizing a kernel though.
Is something like this problematic? Because the 1 is a Int64
?
out[yrot_int + 1, xrot_int + 1, c] += xdiff_1minus * ydiff_1minus * o
I can’t speak for whether this is a problem in CUDA, but I can suggest yrot_int + oneunit(yrot_int)
, yrot_int + oftype(yrot_int, 1)
or yrot_int + true
as operations that will increment yrot_int
while preserving its type.
It would be, however, much of Julia’s indexing infrastructure currently assumes Int
so your indices would likely get promoted anyway. In WIP: Add an index typevar to CuDeviceArray. by maleadt · Pull Request #1895 · JuliaGPU/CUDA.jl · GitHub, I’m experimenting with trying to preserve Int32 indices, but it’s tricky.