CUDA performing scalar indexing when used along with Distributed

Hello @eldee ,

Thanks a lot for putting some effort on helping me out!

I think I’ve got to a solution which, although seeming somewhat memory-wise inefficient, managed to solve the cpu x gpu parallelization problem:

I created a function to be run on pmap. The problem is that since pmap only takes one group of elements (as per my understanding), in order to input many variables I first need to place them into an array, see example below:

@everywhere function pmap_calc(elements)
    ele1 = elements[1]
    ele2 = elements[2]
    ele3 = elements[3]
        
    a, b = _myFunction(ele1, ele2, ele3)          
    return a, b
end

x = Array{Any}(undef, (3, 1)...)
for i in 1:3
    x[i] = [input1, input2, input3]
end

pmap(pmap_calc, x)

If anyone knows a better way to handle at least this pmap (or alternative solutions), it would be great!

Thanks a lot!