Hi everyone,

I am trying to get my head around writing GPU kernels in Julia, would be great if someone could help me with my example. Without going into too many details, I am trying to run a simple EM (Expectation-Maximization) algorithm for a mixture of normal distributions, that essentially loops over many individuals, calculates some values using simple linear algebra, and spits out some other matrices (which are used later on to calculate likelihood). The problem with running this on CPU is that it takes quite a bit of time, as I have to loop over about 20000 individuals, and for each of them calculate some values, so doing this on GPU seems like a good idea. Here is a MWE of what I am trying to do

```
using LinearAlgebra
Nobs = 100
nM = 2
nZ = 5
y = rand(Nobs, nZ)
y[1, 3] = 0
y[2, 4] = 0
y[2, 5] = 0
y[4, 2] = 0
comp = [findall(x -> x .!= 0, y[i,:]) for i = 1:Nobs]
sigma0 = [Matrix{Float64}(I,nZ,nZ),Matrix{Float64}(I,nZ,nZ)]
mean0 = zeros(2,nZ)
probi = zeros(Nobs, nM)
for i = 1:Nobs, m = 1:nM
Q = ((1/(2*pi))^(length(comp[i])/2))*((1/det(sigma0[m][comp[i],comp[i]]))^.5)
invsigma = inv(sigma0[m][comp[i],comp[i]])
probi[i,m] = Q*exp(-.5*transpose(y[i,comp[i]] - mean0[m,comp[i]])*invsigma*(y[i,comp[i]] - mean0[m,comp[i]]))
end
sigmai1 = [zeros(Nobs, nZ, nZ) for m = 1:nM]
prob = 0.5*ones(Nobs, nM)
for i = 1:Nobs, m = 1:nM
temp = zeros(nZ,nZ)
temp[comp[i],comp[i]] = prob[i,m]*(y[i,comp[i]]-mean0[m,comp[i]])*transpose(y[i,comp[i]]-mean0[m,comp[i]])
sigmai1[m][i,:,:] = temp
end
```

In the example below `y`

is the data I load, `Nobs`

is the number of people in the data, that gets very large (around 20000), `nM`

is always either 2 (at most or 3). As some data might have missing observations (or 0), `comp`

is an array that keeps a track of non-missing values. What takes plenty of time are the two loops, one to calculate `probi`

and one to calculate `sigmai1`

. So I was thinking about recasting those to the GPU, using `CuArrays, CUDAnative, CUDAdrv`

, etc.

Could someone help on setting up the kernel functions for those two loops? I mostly confused on how to work with arrays of arrays (such as `comp`

).

Thanks in advance!

Myroslav