 # How do I to transform mapreduce function to work well with CUDA?

Hellos

I’m playing with CUDA.jl, and I just wish to perform a mapreduce faster. Here a minimal code:

``````using CUDA
using LinearAlgebra

x =  100
N = ((x^2)÷2 - x÷2)
n_hat = rand(3)
A = rand(3,N)
b = rand(N)

for p=1:10
@time mapreduce(+, 1:N) do k
b[k]*cis(dot(n_hat, A[:,k]))
end
end

d_n_hat = CuArray(n_hat)
d_A = CuArray(A)
d_b = CuArray(b)
for p=1:10
@time mapreduce(+, 1:N) do k
d_b[k]*cis(dot(d_n_hat, d_A[:,k]))
end
end
``````

I know that is not just copy and paste my CPU code because I’m doing slicing of a matrix, but I don’t know how to change my code properly - and all my attempts got me worst results.
Any suggestions ?

https://juliagpu.github.io/CUDA.jl/stable/usage/workflow/#UsageWorkflowScalar

You are not running code on the GPU, and should have not ignored the warning that CUDA.jl gave you when running this code:

``````Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with `allowscalar(false)`
``````

It’s impossible to dispatch this `mapreduce` to the GPU implementation, because you’re only giving it a UnitRange. You should use CuArray inputs instead.

hummmmmmm… indeed this time the warning has revelant.

If I have to use 100% array operations, the best that I come up right now was this:

`sum(transpose(cis.( sum(repeat(d_n_hat, 1, N).*d_A, dims=1) )).*d_b)`

But I have to use a lot of memory - that I would like to avoid.

You can do better by writing some of it as matrix multiplication:

``````julia> n_hat = rand(3);  A = rand(3, 4950); b = rand(4950);

julia> orig(b, n_hat, A) = mapreduce(+, axes(A,2)) do k
b[k]*cis(dot(n_hat, A[:,k]))  # @view A[:,k] would help a bit
end;

julia> second(b, n_hat, A) = sum(transpose(cis.( sum(repeat(n_hat, 1, size(A,2)).*A, dims=1) )).*b);

julia> cast(b, n_hat, A) = sum(transpose(b) .* cis.(n_hat'*A));

julia> @btime orig(\$b, \$n_hat, \$A)
144.500 μs (4950 allocations: 541.41 KiB)
2351.495637869573 + 780.3363178842503im

julia> @btime second(\$b, \$n_hat, \$A)
92.542 μs (12 allocations: 426.05 KiB)
2351.4956378695733 + 780.3363178842504im

julia> @btime cast(\$b, \$n_hat, \$A)  # still more memory than ideal
35.708 μs (4 allocations: 116.22 KiB)
2351.4956378695733 + 780.3363178842505im

julia> @btime \$n_hat' * \$A;  # all the time is in broadcast and reduction
7.219 μs (2 allocations: 38.77 KiB)

julia> third(b, n_hat, A) = mapreduce((x,y)->x*cis(y), +, transpose(b), (n_hat'*A));

julia> @btime third(\$b, \$n_hat, \$A)  # multi-arg mapreduce just calls map, #38558
37.625 μs (4 allocations: 116.22 KiB)
2351.4956378695733 + 780.3363178842505im

julia> using MappedArrays

julia> mapped(b, n_hat, A) = sum(mappedarray((x,y)->x*cis(y), transpose(b), (n_hat'*A)));

julia> @btime mapped(\$b, \$n_hat, \$A)  # saves some memory, but little time?
32.875 μs (2 allocations: 38.77 KiB)
2351.495637869573 + 780.3363178842503im
``````

Thank you, I used your broadcast and reduction into CUDA and it works. I will manage the memory to make sure that everything fits on the card.

However, I was looking in discourse, and I found a topic using `Folds.jl`. Then, using into my code, it gave me a very good speed up.

For anyone interested, here all benchmarks. The first result is has `Folds.jl` and the last with CUDA.

``````using Libdl
Libdl.dlopen("/home/pc2/.julia/artifacts/4670dd02df5210bd53199f14ec9f8cc027d889e0/lib/libcublasLt.so.11")
using CUDA
CUDA.allowscalar(false)

using LinearAlgebra
using Folds
using MappedArrays

orig(b, n_hat, A) = mapreduce(+, axes(A,2)) do k
b[k]*cis(dot(n_hat, A[:,k]))  # @view A[:,k] would help a bit
end;

orig_folds(b, n_hat, A) = Folds.mapreduce(+, axes(A,2)) do k
b[k]*cis(  n_hat*A[1, k] + n_hat*A[2, k] + n_hat*A[3, k] )
end;
second(b, n_hat, A) = sum(transpose(cis.( sum(repeat(n_hat, 1, size(A,2)).*A, dims=1) )).*b);
cast(b, n_hat, A) = sum(transpose(b) .* cis.(n_hat'*A));
third(b, n_hat, A) = mapreduce((x,y)->x*cis(y), +, transpose(b), (n_hat'*A));
mapped(b, n_hat, A) = sum(mappedarray((x,y)->x*cis(y), transpose(b), (n_hat'*A)));

using BenchmarkTools
function foo(x)
N = ((x^2)÷2 - x÷2)

A = rand(3,N)
b = rand(ComplexF64, N)
n_hat = rand(3)

@btime orig_folds(\$b, \$n_hat, \$A)
@btime orig(\$b, \$n_hat, \$A)
@btime second(\$b, \$n_hat, \$A)
@btime cast(\$b, \$n_hat, \$A)
@btime sum(cis.(transpose((\$n_hat)' * \$A)).*\$b) ## I converted this expression to CUDA
@btime third(\$b, \$n_hat, \$A)
@btime mapped(\$b, \$n_hat, \$A)

d_A = CuArray(transpose(A))
d_b = CuArray(b)
d_n_hat = CuArray(n_hat)
@btime sum( cis.(\$d_A *\$d_n_hat).*\$d_b )

return 0
end

julia> foo(6000)
55.285 ms (65 allocations: 4.22 KiB)
1.701 s (17997000 allocations: 1.88 GiB)
979.636 ms (12 allocations: 1.47 GiB)
374.404 ms (4 allocations: 411.92 MiB)
364.774 ms (4 allocations: 411.92 MiB)
382.585 ms (4 allocations: 411.92 MiB)
314.088 ms (2 allocations: 137.31 MiB)
8.632 ms (113 allocations: 2.88 KiB
``````

FWIW, `Folds.mapreduce` can run on CUDA with FoldsCUDA.jl, even without CuArray input, if you pass the executor object to it:

``````julia> orig_folds(b, n_hat, A, ex = ThreadedEx()) = Folds.mapreduce(+, axes(A,2), ex) do k
b[k]*cis(  n_hat*A[1, k] + n_hat*A[2, k] + n_hat*A[3, k] )
end;

julia> Threads.nthreads()
1

julia> @btime orig_folds(b, n_hat, A);
532.295 ms (2 allocations: 64 bytes)

julia> @btime CUDA.@sync orig_folds(d_b, d_n_hat, transpose(d_A), CUDAEx());
17.620 ms (12728 allocations: 204.77 KiB)

julia> @btime CUDA.@sync sum( cis.(\$d_A *\$d_n_hat).*\$d_b );
2.843 ms (1089 allocations: 18.00 KiB)
``````

(Side note: adding `@inbounds` does not improve the performance)

But FoldsCUDA version does not perform well, maybe because it does not implement the reduction in a memory coalesced form (I’m fixing it).

2 Likes