How do I to transform mapreduce function to work well with CUDA?

Hellos

I’m playing with CUDA.jl, and I just wish to perform a mapreduce faster. Here a minimal code:

using CUDA
using LinearAlgebra

x =  100
N = ((x^2)÷2 - x÷2)
n_hat = rand(3)
A = rand(3,N)
b = rand(N)

for p=1:10
    @time mapreduce(+, 1:N) do k
        b[k]*cis(dot(n_hat, A[:,k]))
    end
end

d_n_hat = CuArray(n_hat)
d_A = CuArray(A)
d_b = CuArray(b)
for p=1:10
    @time mapreduce(+, 1:N) do k
        d_b[k]*cis(dot(d_n_hat, d_A[:,k]))
    end
end

I know that is not just copy and paste my CPU code because I’m doing slicing of a matrix, but I don’t know how to change my code properly - and all my attempts got me worst results.
Any suggestions ?

https://juliagpu.github.io/CUDA.jl/stable/usage/workflow/#UsageWorkflowScalar

You are not running code on the GPU, and should have not ignored the warning that CUDA.jl gave you when running this code:

Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with `allowscalar(false)`

It’s impossible to dispatch this mapreduce to the GPU implementation, because you’re only giving it a UnitRange. You should use CuArray inputs instead.

hummmmmmm… indeed this time the warning has revelant.

If I have to use 100% array operations, the best that I come up right now was this:

sum(transpose(cis.( sum(repeat(d_n_hat, 1, N).*d_A, dims=1) )).*d_b)

But I have to use a lot of memory - that I would like to avoid.

You can do better by writing some of it as matrix multiplication:

julia> n_hat = rand(3);  A = rand(3, 4950); b = rand(4950);

julia> orig(b, n_hat, A) = mapreduce(+, axes(A,2)) do k
         b[k]*cis(dot(n_hat, A[:,k]))  # @view A[:,k] would help a bit
       end;

julia> second(b, n_hat, A) = sum(transpose(cis.( sum(repeat(n_hat, 1, size(A,2)).*A, dims=1) )).*b);

julia> cast(b, n_hat, A) = sum(transpose(b) .* cis.(n_hat'*A));

julia> @btime orig($b, $n_hat, $A)
  144.500 μs (4950 allocations: 541.41 KiB)
2351.495637869573 + 780.3363178842503im

julia> @btime second($b, $n_hat, $A)
  92.542 μs (12 allocations: 426.05 KiB)
2351.4956378695733 + 780.3363178842504im

julia> @btime cast($b, $n_hat, $A)  # still more memory than ideal
  35.708 μs (4 allocations: 116.22 KiB)
2351.4956378695733 + 780.3363178842505im

julia> @btime $n_hat' * $A;  # all the time is in broadcast and reduction
  7.219 μs (2 allocations: 38.77 KiB)

julia> third(b, n_hat, A) = mapreduce((x,y)->x*cis(y), +, transpose(b), (n_hat'*A));

julia> @btime third($b, $n_hat, $A)  # multi-arg mapreduce just calls map, #38558
  37.625 μs (4 allocations: 116.22 KiB)
2351.4956378695733 + 780.3363178842505im

julia> using MappedArrays

julia> mapped(b, n_hat, A) = sum(mappedarray((x,y)->x*cis(y), transpose(b), (n_hat'*A)));

julia> @btime mapped($b, $n_hat, $A)  # saves some memory, but little time?
  32.875 μs (2 allocations: 38.77 KiB)
2351.495637869573 + 780.3363178842503im

Thank you, I used your broadcast and reduction into CUDA and it works. I will manage the memory to make sure that everything fits on the card.

However, I was looking in discourse, and I found a topic using Folds.jl. Then, using into my code, it gave me a very good speed up.

For anyone interested, here all benchmarks. The first result is has Folds.jl and the last with CUDA.

using Libdl
Libdl.dlopen("/home/pc2/.julia/artifacts/4670dd02df5210bd53199f14ec9f8cc027d889e0/lib/libcublasLt.so.11")
using CUDA
CUDA.allowscalar(false)

using LinearAlgebra
using Folds
using MappedArrays

orig(b, n_hat, A) = mapreduce(+, axes(A,2)) do k
    b[k]*cis(dot(n_hat, A[:,k]))  # @view A[:,k] would help a bit
  end;

orig_folds(b, n_hat, A) = Folds.mapreduce(+, axes(A,2)) do k
    b[k]*cis(  n_hat[1]*A[1, k] + n_hat[2]*A[2, k] + n_hat[3]*A[3, k] ) 
end;
second(b, n_hat, A) = sum(transpose(cis.( sum(repeat(n_hat, 1, size(A,2)).*A, dims=1) )).*b);
cast(b, n_hat, A) = sum(transpose(b) .* cis.(n_hat'*A));
third(b, n_hat, A) = mapreduce((x,y)->x*cis(y), +, transpose(b), (n_hat'*A));
mapped(b, n_hat, A) = sum(mappedarray((x,y)->x*cis(y), transpose(b), (n_hat'*A)));

using BenchmarkTools
function foo(x)
    N = ((x^2)÷2 - x÷2)

    A = rand(3,N)
    b = rand(ComplexF64, N)
    n_hat = rand(3)

    
    @btime orig_folds($b, $n_hat, $A)
    @btime orig($b, $n_hat, $A)
    @btime second($b, $n_hat, $A)
    @btime cast($b, $n_hat, $A)
    @btime sum(cis.(transpose(($n_hat)' * $A)).*$b) ## I converted this expression to CUDA
    @btime third($b, $n_hat, $A) 
    @btime mapped($b, $n_hat, $A)

    
    d_A = CuArray(transpose(A))
    d_b = CuArray(b)
    d_n_hat = CuArray(n_hat)
    @btime sum( cis.($d_A *$d_n_hat).*$d_b )
    
    return 0
end

julia> foo(6000)
55.285 ms (65 allocations: 4.22 KiB)
1.701 s (17997000 allocations: 1.88 GiB)
979.636 ms (12 allocations: 1.47 GiB)
374.404 ms (4 allocations: 411.92 MiB)
364.774 ms (4 allocations: 411.92 MiB)
382.585 ms (4 allocations: 411.92 MiB)
314.088 ms (2 allocations: 137.31 MiB)
8.632 ms (113 allocations: 2.88 KiB

FWIW, Folds.mapreduce can run on CUDA with FoldsCUDA.jl, even without CuArray input, if you pass the executor object to it:

julia> orig_folds(b, n_hat, A, ex = ThreadedEx()) = Folds.mapreduce(+, axes(A,2), ex) do k
           b[k]*cis(  n_hat[1]*A[1, k] + n_hat[2]*A[2, k] + n_hat[3]*A[3, k] )
       end;

julia> Threads.nthreads()
1

julia> @btime orig_folds(b, n_hat, A);
  532.295 ms (2 allocations: 64 bytes)

julia> @btime CUDA.@sync orig_folds(d_b, d_n_hat, transpose(d_A), CUDAEx());
  17.620 ms (12728 allocations: 204.77 KiB)

julia> @btime CUDA.@sync sum( cis.($d_A *$d_n_hat).*$d_b );
  2.843 ms (1089 allocations: 18.00 KiB)

(Side note: adding @inbounds does not improve the performance)

But FoldsCUDA version does not perform well, maybe because it does not implement the reduction in a memory coalesced form (I’m fixing it).

2 Likes