I would like to do something similar to the following:
f(x) = x+1, x+2
a = Vector{Int}(undef, 10)
b = similar(a)
a, b .= f.(1:10)
which of course does not work. For context, I have a trivially parallel operation to perform that I want to at least test as a GPU kernel; each operation produces several resulting values and it would be most natural to store each in its own array. So a, b, and 1:10 would all be CuArrays. I could of course write results to an array-of-structs if there’s not a more elegant solution.
Can anyone think of an elegant away to implement this kind of “destructuring” of a broadcast result that will be performant with CuArrays? And please correct me if this is a misconception, but I believe it will not be efficient to just write a for loop when working with CuArrays, unless I want to get low-level and write code using CUDANative myself.