I am trying to use CUDA to speed up the process of finding the exponential of a matrix.

For some reason, all the steps are fast except the code written inside a loop

```
using LinearAlgebra
using SparseArrays
using CUDA
using CUDA.CUSPARSE
n = 15_000;
A = sprand(n,n,6/n);
A_gpu = CuArray(A)
function expCU(A_gpu::CuArray{Float64,2};threshold=1e-6)
rows = LinearAlgebra.checksquare(A_gpu);
P = spzeros(rows,rows);
P[diagind(P)] = ones(rows);
mat_norm=norm(A_gpu,Inf);
# Scaling to ensure matrix doesn't blow up ensuring significant speed up
scaling_factor = nextpow(2,mat_norm);
A_gpu = A_gpu./scaling_factor;
P_gpu = CuArray{Float64,2}(P);
next_term = copy(P_gpu);
n = 1;
delta=norm(next_term,Inf);
nonzero_tol = 1e-12;
while delta>threshold
next_term=(1/n)*A_gpu*next_term;
# timing this step
@time delta=norm(next_term,Inf);
P_gpu = P_gpu.+ next_term; n=n+1;
end
# Squaring of P to generate correct P
for n=1:log2(scaling_factor)
P_gpu=P_gpu*P_gpu;
end
return P_gpu
end
expCU(A_gpu);
```

I have found that the bottleneck is the step timed in the above code. The output for a specific run is

```
0.375602 seconds (78 allocations: 3.281 KiB)
0.359730 seconds (78 allocations: 3.281 KiB)
0.359744 seconds (78 allocations: 3.281 KiB)
0.359743 seconds (78 allocations: 3.281 KiB)
0.359767 seconds (78 allocations: 3.281 KiB)
0.363671 seconds (78 allocations: 3.281 KiB)
0.363688 seconds (78 allocations: 3.281 KiB)
0.363686 seconds (78 allocations: 3.281 KiB)
0.359806 seconds (78 allocations: 3.281 KiB)
0.363658 seconds (78 allocations: 3.281 KiB)
0.363692 seconds (78 allocations: 3.281 KiB)
0.363676 seconds (78 allocations: 3.281 KiB)
0.359748 seconds (78 allocations: 3.281 KiB)
```

Now when I try running the step as is on my terminal (REPL) I get the following time

```
0.029465 seconds (79 allocations: 3.297 KiB)
```

This means that this step is significantly slower when inside the loop for whatever reason. When I try timing the steps without the loop (manually) it is pretty fast. I am pretty new to GPU computing so I’d be really grateful if someone can explain what I am doing wrong. Thanks!