CUDA performing scalar indexing when used along with Distributed

eldee · September 19, 2024, 7:37pm

Hi Paulo, you’re welcome.

Could you add some more information to the code: declarations of input1 etc., a (dummy) implementation of _myFunction, …? See also point 4 in this PSA. It’s hard to see how something can be improved, when you’re not sure what is concretely going on .

I’m also not sure what the intent is here. Why is size(x) == (3, 1), instead of just (3,)? By the way, note that you don’t need the splatting ...: (undef, (3, 1)) and (undef, 3, 1) (what the splatting results in) are equivalent. Is it intended that x[1] == x[2] == x[3]? Why Array{Any} and not something a bit more concrete like Vector{NTuple{3, CuArray}}?

The second argument to (p)map does not need to be a Vector, but could also be e.g. a Tuple or generator, which might help with the inefficient memory usage you mention. For example:

using Distributed
addprocs(2)
@everywhere begin
    using CUDA 
    using Statistics: mean
end

pmap(mean, (CUDA.rand(2) .+ myid() for i = 1:3))
#=
3-element Vector{Float32}:
 1.9031491
 1.707502
 1.1796367
=#

pmap(x -> myid() + mean(x), (CUDA.rand(2) for i = 1:3))
#=
3-element Vector{Float32}:
 3.316328
 2.5876007
 2.284222
=#

Note also that this example shows that the CUDA data is generated here by the master process (myid() == 1) and sent over to the other processes without any issues.

Topic		Replies	Views
Overcoming Slow Scalar Operations on GPU Arrays GPU performance	19	6372	January 4, 2021
Julia 1.0 Example of @distributed and pmap Julia at Scale	1	3826	August 26, 2018
Error using sum in DistributedArrays Julia at Scale	2	1153	November 11, 2017
Map Performance with CuArrays GPU question , fftw , cuda , broadcast	15	5274	January 4, 2021
Strange comportment of @distributed Performance question , distributed , loops	4	692	March 25, 2020

CUDA performing scalar indexing when used along with Distributed

Related topics