Thank you, I used your broadcast and reduction into CUDA and it works. I will manage the memory to make sure that everything fits on the card.

However, I was looking in discourse, and I found a topic using `Folds.jl`

. Then, using into my code, it gave me a very good speed up.

For anyone interested, here all benchmarks. The first result is has `Folds.jl`

and the last with CUDA.

```
using Libdl
Libdl.dlopen("/home/pc2/.julia/artifacts/4670dd02df5210bd53199f14ec9f8cc027d889e0/lib/libcublasLt.so.11")
using CUDA
CUDA.allowscalar(false)
using LinearAlgebra
using Folds
using MappedArrays
orig(b, n_hat, A) = mapreduce(+, axes(A,2)) do k
b[k]*cis(dot(n_hat, A[:,k])) # @view A[:,k] would help a bit
end;
orig_folds(b, n_hat, A) = Folds.mapreduce(+, axes(A,2)) do k
b[k]*cis( n_hat[1]*A[1, k] + n_hat[2]*A[2, k] + n_hat[3]*A[3, k] )
end;
second(b, n_hat, A) = sum(transpose(cis.( sum(repeat(n_hat, 1, size(A,2)).*A, dims=1) )).*b);
cast(b, n_hat, A) = sum(transpose(b) .* cis.(n_hat'*A));
third(b, n_hat, A) = mapreduce((x,y)->x*cis(y), +, transpose(b), (n_hat'*A));
mapped(b, n_hat, A) = sum(mappedarray((x,y)->x*cis(y), transpose(b), (n_hat'*A)));
using BenchmarkTools
function foo(x)
N = ((x^2)÷2 - x÷2)
A = rand(3,N)
b = rand(ComplexF64, N)
n_hat = rand(3)
@btime orig_folds($b, $n_hat, $A)
@btime orig($b, $n_hat, $A)
@btime second($b, $n_hat, $A)
@btime cast($b, $n_hat, $A)
@btime sum(cis.(transpose(($n_hat)' * $A)).*$b) ## I converted this expression to CUDA
@btime third($b, $n_hat, $A)
@btime mapped($b, $n_hat, $A)
d_A = CuArray(transpose(A))
d_b = CuArray(b)
d_n_hat = CuArray(n_hat)
@btime sum( cis.($d_A *$d_n_hat).*$d_b )
return 0
end
julia> foo(6000)
55.285 ms (65 allocations: 4.22 KiB)
1.701 s (17997000 allocations: 1.88 GiB)
979.636 ms (12 allocations: 1.47 GiB)
374.404 ms (4 allocations: 411.92 MiB)
364.774 ms (4 allocations: 411.92 MiB)
382.585 ms (4 allocations: 411.92 MiB)
314.088 ms (2 allocations: 137.31 MiB)
8.632 ms (113 allocations: 2.88 KiB
```