CUDA performing scalar indexing when used along with Distributed

Hi Paulo, you’re welcome.

Could you add some more information to the code: declarations of input1 etc., a (dummy) implementation of _myFunction, …? See also point 4 in this PSA. It’s hard to see how something can be improved, when you’re not sure what is concretely going on :slight_smile: .

I’m also not sure what the intent is here. Why is size(x) == (3, 1), instead of just (3,)? By the way, note that you don’t need the splatting ...: (undef, (3, 1)) and (undef, 3, 1) (what the splatting results in) are equivalent. Is it intended that x[1] == x[2] == x[3]? Why Array{Any} and not something a bit more concrete like Vector{NTuple{3, CuArray}}?

The second argument to (p)map does not need to be a Vector, but could also be e.g. a Tuple or generator, which might help with the inefficient memory usage you mention. For example:

using Distributed
addprocs(2)
@everywhere begin
    using CUDA 
    using Statistics: mean
end

pmap(mean, (CUDA.rand(2) .+ myid() for i = 1:3))
#=
3-element Vector{Float32}:
 1.9031491
 1.707502
 1.1796367
=#

pmap(x -> myid() + mean(x), (CUDA.rand(2) for i = 1:3))
#=
3-element Vector{Float32}:
 3.316328
 2.5876007
 2.284222
=#

Note also that this example shows that the CUDA data is generated here by the master process (myid() == 1) and sent over to the other processes without any issues.