I have fairly simple function that iterates over the columns of a 2-dimensional `CuArray`

, takes inner products of these columns and writes them to another matrix. I noticed that running this code on the GPU gave no speed-up, even though simply taking the inner product of two `CuArray`

s was much faster on my machine, than with `Array`

s. I assume that the problem are the views.

See the following MWE:

```
using CUDA
using BenchmarkTools
using LinearAlgebra
function view_multiplication(Ψ::AbstractMatrix)
d1, d2 = size(ψ)
out = Array{eltype(ψ)}(undef, d2, d2)
for i in 1:d2
for j in 1:i
@inbounds @views out[i,j] = ψ[:,i] ⋅ ψ[:,j]
@inbounds out[j,i] = out[i,j]'
end
end
return out
end
```

and running this on a `(2^18 x 2^4)`

matrix I get on the CPU (the second time, after precompilation)

```
ψ = rand(ComplexF32, 2^18, 2^4)
@time view_multiplication(ψ)
0.048713 seconds (1.41 k allocations: 35.547 KiB)
```

and on the GPU (also without precompilation)

```
ψ = CUDA.rand(ComplexF32, 2^18, 2^4)
@time view_multiplication(ψ)
0.226695 seconds (3.94 k allocations: 544.086 MiB, 12.08% gc time)
```

which is much slower than I hoped for and also allocates much more memory. Is there a way to take inner products of views of `CuArray`

s without needing to allocate memory for the views?