Hi,
Yes, this looks fine. It might be slightly better to use a 32-bit literals (1i32 after using CUDA: i32), and potentially ifelse might be faster than the ternary operator. But I can’t measure any difference, so it probably doesn’t really matter here.
The reason is that this would create a CuVector, i.e. allocate memory, which is not allowed inside of kernels (or at least not in this manner). While these error messages are often hard to interpret, here it does explicitly mention allocating memory:
ERROR: InvalidIRError: (...)
Reason: unsupported call to an unknown function (call to jl_alloc_genericmemory)
Stacktrace:
[1] GenericMemory
In contrast, rand(Float32) returns a simple scalar. Inside of a kernel, this will automatically reside on the device.
I’m not familiar with this library, but presumably it allocates, is type-unstable, or uses non-isbits structs which are not adapted for the GPU.
By the way, in
you can just use sum(d_results), which will then perform the summation on the GPU, and return the scalar result on the CPU.