 Writing simple kernel functions (EM algorithm)

Hi everyone,

I am trying to get my head around writing GPU kernels in Julia, would be great if someone could help me with my example. Without going into too many details, I am trying to run a simple EM (Expectation-Maximization) algorithm for a mixture of normal distributions, that essentially loops over many individuals, calculates some values using simple linear algebra, and spits out some other matrices (which are used later on to calculate likelihood). The problem with running this on CPU is that it takes quite a bit of time, as I have to loop over about 20000 individuals, and for each of them calculate some values, so doing this on GPU seems like a good idea. Here is a MWE of what I am trying to do

``````using LinearAlgebra

Nobs = 100
nM = 2
nZ = 5

y = rand(Nobs, nZ)
y[1, 3] = 0
y[2, 4] = 0
y[2, 5] = 0
y[4, 2] = 0

comp = [findall(x -> x .!= 0, y[i,:]) for i = 1:Nobs]
sigma0 = [Matrix{Float64}(I,nZ,nZ),Matrix{Float64}(I,nZ,nZ)]
mean0 = zeros(2,nZ)

probi = zeros(Nobs, nM)

for i = 1:Nobs, m = 1:nM
Q = ((1/(2*pi))^(length(comp[i])/2))*((1/det(sigma0[m][comp[i],comp[i]]))^.5)
invsigma = inv(sigma0[m][comp[i],comp[i]])
probi[i,m]  =  Q*exp(-.5*transpose(y[i,comp[i]] - mean0[m,comp[i]])*invsigma*(y[i,comp[i]] - mean0[m,comp[i]]))
end

sigmai1 = [zeros(Nobs, nZ, nZ) for m = 1:nM]
prob = 0.5*ones(Nobs, nM)

for i = 1:Nobs, m = 1:nM
temp = zeros(nZ,nZ)
temp[comp[i],comp[i]] =  prob[i,m]*(y[i,comp[i]]-mean0[m,comp[i]])*transpose(y[i,comp[i]]-mean0[m,comp[i]])
sigmai1[m][i,:,:] = temp
end
``````

In the example below `y` is the data I load, `Nobs` is the number of people in the data, that gets very large (around 20000), `nM` is always either 2 (at most or 3). As some data might have missing observations (or 0), `comp` is an array that keeps a track of non-missing values. What takes plenty of time are the two loops, one to calculate `probi` and one to calculate `sigmai1`. So I was thinking about recasting those to the GPU, using `CuArrays, CUDAnative, CUDAdrv`, etc.

Could someone help on setting up the kernel functions for those two loops? I mostly confused on how to work with arrays of arrays (such as `comp`).

Myroslav

Change the array of arrays to some ragged array structure such as: https://github.com/mohamed82008/TopOpt.jl/blob/master/src/Utilities/utils.jl#L37

2 Likes

Ok, I see, thanks. That way I can get around with using arrays of arrays.
Assuming `comp` now is a rectangular array, `y` will still have some missing observations, so I need to select `y[comp]` anyway. How to write that in a kernel?

Hmm on second look your code is a bit involved. I think your biggest challenge is getting `invsigma` and finding the determinant. `inv` and `det` are not defined properly for `CuArrays`. There is however a batch inverse function in `CuArrays` which you can use to pre-compute all the inverses. You may also probably be able to indirectly find the determinants by some batch decomposition (haven’t looked into it).

Now if your matrix is not too big (`nZ` not much bigger than 10), you can use StaticArrays.jl which supports `inv` and `det`. And since all you are doing with the inverse is multiply it by a vector, you can also just use the `\` function. The multiplication will also just work with `StaticArrays`. All of these operations are kernel-friendly because `SVector` and `SMatrix` are immutable types and these operations were implemented for them in a way that doesn’t allocate for small-sized matrices.

Note that you will need to make `y` and `mean0` of type `CuVector{<:SVector}`, s.t. all the `SVector`s are the same size. So do your pre-processing to pad with zeros as needed. Also, all the matrices should be of the same size and invertible so pad the diagonals with ones and the off-diagonals with zeros as needed.

1 Like