I would like to do something similar to the following:
f(x) = x+1, x+2
a = Vector{Int}(undef, 10)
b = similar(a)
a, b .= f.(1:10)
which of course does not work. For context, I have a trivially parallel operation to perform that I want to at least test as a GPU kernel; each operation produces several resulting values and it would be most natural to store each in its own array. So a
, b
, and 1:10
would all be CuArray
s. I could of course write results to an array-of-structs if there’s not a more elegant solution.
Can anyone think of an elegant away to implement this kind of “destructuring” of a broadcast result that will be performant with CuArray
s? And please correct me if this is a misconception, but I believe it will not be efficient to just write a for
loop when working with CuArray
s, unless I want to get low-level and write code using CUDANative myself.