Hello!

I have recently ported a matlab package to Julia (and found the process suprisingly straight forward), however I with to be able to execute it on the GPU as I with to use it as part of a loss function in an ML project I have. Currently it runs on the GPU very slowly, with considerable allocations. Since the code was ported quite directly from matlab, I’m sure there are better and more GPU friendly ways to structure the code and was hoping for any guidance. I have reduced my code to what I think is the key issues below:

```
function calc_mi(k,X, Y)
input = [X[1:end-1,k] Y[2:end,1:end]]
Dx = size(X, 2)
rho = cor(input)
mi = 0.5 * logdet(rho[1:Dx, 1:Dx] + logdet(rho[Dx+1:end, Dx+1:end]) - logdet(rho))
return mi
end
function calc_ψ(X,Y)
mapreduce(k -> calc_mi(k,X,Y),+, 1:size(X,2))
end
```

Obviously scalar indexing is a key issue, but I’m not sure how to convert it to a broadcast call since I’m trying to extract submatricies rather than individual values.

One other issue that I’m not sure how to deal with is the first line of the `calc_mi`

function (which is essentially a call to `hcat`

I think), which will end up allocating each call. I looked at how I might use views to avoid allocations, found a few almost helpful packages (such as GitHub - ahwillia/CatViews.jl: Concatenated Array views in Julia.), tried to roll my own struct to handle indexing into two views but ultimately felt I may be goind down the wrong path.

I also found the Folds and FoldsCUDA library, but when trying to use the CUDAEx() executor, I got lots of errors about dynamic methods calls (I’m guess it tries to construct a CUDA kernel with the contents of my mapreduce function or something?).

Any guidance greatly appreciated