Hi Paulo, you’re welcome.
Could you add some more information to the code: declarations of input1
etc., a (dummy) implementation of _myFunction
, …? See also point 4 in this PSA. It’s hard to see how something can be improved, when you’re not sure what is concretely going on .
I’m also not sure what the intent is here. Why is size(x) == (3, 1)
, instead of just (3,)
? By the way, note that you don’t need the splatting ...
: (undef, (3, 1))
and (undef, 3, 1)
(what the splatting results in) are equivalent. Is it intended that x[1] == x[2] == x[3]
? Why Array{Any}
and not something a bit more concrete like Vector{NTuple{3, CuArray}}
?
The second argument to (p)map
does not need to be a Vector
, but could also be e.g. a Tuple
or generator, which might help with the inefficient memory usage you mention. For example:
using Distributed
addprocs(2)
@everywhere begin
using CUDA
using Statistics: mean
end
pmap(mean, (CUDA.rand(2) .+ myid() for i = 1:3))
#=
3-element Vector{Float32}:
1.9031491
1.707502
1.1796367
=#
pmap(x -> myid() + mean(x), (CUDA.rand(2) for i = 1:3))
#=
3-element Vector{Float32}:
3.316328
2.5876007
2.284222
=#
Note also that this example shows that the CUDA data is generated here by the master process (myid() == 1
) and sent over to the other processes without any issues.