Writing simple kernel functions (EM algorithm)

Hi everyone,

I am trying to get my head around writing GPU kernels in Julia, would be great if someone could help me with my example. Without going into too many details, I am trying to run a simple EM (Expectation-Maximization) algorithm for a mixture of normal distributions, that essentially loops over many individuals, calculates some values using simple linear algebra, and spits out some other matrices (which are used later on to calculate likelihood). The problem with running this on CPU is that it takes quite a bit of time, as I have to loop over about 20000 individuals, and for each of them calculate some values, so doing this on GPU seems like a good idea. Here is a MWE of what I am trying to do

using LinearAlgebra

Nobs = 100
nM = 2
nZ = 5

y = rand(Nobs, nZ)
y[1, 3] = 0
y[2, 4] = 0
y[2, 5] = 0
y[4, 2] = 0

comp = [findall(x -> x .!= 0, y[i,:]) for i = 1:Nobs]
sigma0 = [Matrix{Float64}(I,nZ,nZ),Matrix{Float64}(I,nZ,nZ)]
mean0 = zeros(2,nZ)

probi = zeros(Nobs, nM)

for i = 1:Nobs, m = 1:nM
        Q = ((1/(2*pi))^(length(comp[i])/2))*((1/det(sigma0[m][comp[i],comp[i]]))^.5)
        invsigma = inv(sigma0[m][comp[i],comp[i]])
        probi[i,m]  =  Q*exp(-.5*transpose(y[i,comp[i]] - mean0[m,comp[i]])*invsigma*(y[i,comp[i]] - mean0[m,comp[i]]))
end

sigmai1 = [zeros(Nobs, nZ, nZ) for m = 1:nM]
prob = 0.5*ones(Nobs, nM)

for i = 1:Nobs, m = 1:nM
        temp = zeros(nZ,nZ)
        temp[comp[i],comp[i]] =  prob[i,m]*(y[i,comp[i]]-mean0[m,comp[i]])*transpose(y[i,comp[i]]-mean0[m,comp[i]])
        sigmai1[m][i,:,:] = temp
end

In the example below y is the data I load, Nobs is the number of people in the data, that gets very large (around 20000), nM is always either 2 (at most or 3). As some data might have missing observations (or 0), comp is an array that keeps a track of non-missing values. What takes plenty of time are the two loops, one to calculate probi and one to calculate sigmai1. So I was thinking about recasting those to the GPU, using CuArrays, CUDAnative, CUDAdrv, etc.

Could someone help on setting up the kernel functions for those two loops? I mostly confused on how to work with arrays of arrays (such as comp).

Thanks in advance!

Myroslav

Change the array of arrays to some ragged array structure such as: TopOpt.jl/utils.jl at master · JuliaTopOpt/TopOpt.jl · GitHub

2 Likes

Ok, I see, thanks. That way I can get around with using arrays of arrays.
Assuming comp now is a rectangular array, y will still have some missing observations, so I need to select y[comp] anyway. How to write that in a kernel?

Hmm on second look your code is a bit involved. I think your biggest challenge is getting invsigma and finding the determinant. inv and det are not defined properly for CuArrays. There is however a batch inverse function in CuArrays which you can use to pre-compute all the inverses. You may also probably be able to indirectly find the determinants by some batch decomposition (haven’t looked into it).

Now if your matrix is not too big (nZ not much bigger than 10), you can use StaticArrays.jl which supports inv and det. And since all you are doing with the inverse is multiply it by a vector, you can also just use the \ function. The multiplication will also just work with StaticArrays. All of these operations are kernel-friendly because SVector and SMatrix are immutable types and these operations were implemented for them in a way that doesn’t allocate for small-sized matrices.

Note that you will need to make y and mean0 of type CuVector{<:SVector}, s.t. all the SVectors are the same size. So do your pre-processing to pad with zeros as needed. Also, all the matrices should be of the same size and invertible so pad the diagonals with ones and the off-diagonals with zeros as needed.

1 Like

I’m curious to know if you managed to run findall on the GPU? I get an error for it always returns an array whose size is variable.

You should be able to do this using filter.

I tried this:

A=zeros(Int,40)
A[10:20].=1
cA=cu(A)
filter(x->x==1||x!=1&&CuArrays.rand(1)<0.5,cA)

It gives the following error:

GPU compilation of #25(CuArrays.CuKernelState, CuDeviceArray{Bool,1,CUDAnative.AS.Global}, Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},var"#10#11",Tuple{Base.Broadcast.Extruded{CuDeviceArray{Float32,1,CUDAnative.AS.Global},Tuple{Bool},Tuple{Int64}}}}) failed
KernelError: recursion is currently not supported

Try inspecting the generated code with any of the @device_code_... macros.

Stacktrace:
 [1] _nextind_str at strings/string.jl:140
 [2] nextind at strings/string.jl:136
 [3] _nextind_str at strings/string.jl:140
 [4] _split at strings/util.jl:325
 [5] env_override_minlevel at logging.jl:419
 [6] current_logger_for_env at logging.jl:383
 [7] maybe_initialize at /root/.julia/packages/CUDAnative/2WQzk/src/init.jl:30
 [8] #25 at /root/.julia/packages/GPUArrays/0lvhc/src/broadcast.jl:49

Stacktrace:
 [1] (::CUDAnative.var"#hook_emit_function#100"{CUDAnative.CompilerJob,Array{Core.MethodInstance,1}})(::Core.MethodInstance, ::Core.CodeInfo, ::UInt64) at /root/.julia/packages/CUDAnative/2WQzk/src/compiler/irgen.jl:102
 [2] compile_method_instance(::CUDAnative.CompilerJob, ::Core.MethodInstance, ::UInt64) at /root/.julia/packages/CUDAnative/2WQzk/src/compiler/irgen.jl:149
 [3] macro expansion at /root/.julia/packages/TimerOutputs/7Id5J/src/TimerOutput.jl:214 [inlined]
 [4] irgen(::CUDAnative.CompilerJob, ::Core.MethodInstance, ::UInt64) at /root/.julia/packages/CUDAnative/2WQzk/src/compiler/irgen.jl:163
 [5] #codegen#152(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::typeof(CUDAnative.codegen), ::Symbol, ::CUDAnative.CompilerJob) at /root/.julia/packages/TimerOutputs/7Id5J/src/TimerOutput.jl:214
 [6] #codegen at ./none:0 [inlined]
 [7] #compile#151(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::typeof(CUDAnative.compile), ::Symbol, ::CUDAnative.CompilerJob) at /root/.julia/packages/CUDAnative/2WQzk/src/compiler/driver.jl:47
 [8] #compile at ./none:0 [inlined]
 [9] #compile#150 at /root/.julia/packages/CUDAnative/2WQzk/src/compiler/driver.jl:28 [inlined]
 [10] #compile at ./none:0 [inlined] (repeats 2 times)
 [11] macro expansion at /root/.julia/packages/CUDAnative/2WQzk/src/execution.jl:403 [inlined]
 [12] #cufunction#194(::Nothing, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(cufunction), ::GPUArrays.var"#25#26", ::Type{Tuple{CuArrays.CuKernelState,CuDeviceArray{Bool,1,CUDAnative.AS.Global},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},var"#10#11",Tuple{Base.Broadcast.Extruded{CuDeviceArray{Float32,1,CUDAnative.AS.Global},Tuple{Bool},Tuple{Int64}}}}}}) at /root/.julia/packages/CUDAnative/2WQzk/src/execution.jl:368
 [13] cufunction(::Function, ::Type) at /root/.julia/packages/CUDAnative/2WQzk/src/execution.jl:368
 [14] macro expansion at /root/.julia/packages/CUDAnative/2WQzk/src/execution.jl:176 [inlined]
 [15] macro expansion at ./gcutils.jl:91 [inlined]
 [16] macro expansion at /root/.julia/packages/CUDAnative/2WQzk/src/execution.jl:173 [inlined]
 [17] _gpu_call(::CuArrays.CuArrayBackend, ::Function, ::CuArray{Bool,1,Nothing}, ::Tuple{CuArray{Bool,1,Nothing},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},var"#10#11",Tuple{Base.Broadcast.Extruded{CuArray{Float32,1,Nothing},Tuple{Bool},Tuple{Int64}}}}}, ::Tuple{Tuple{Int64},Tuple{Int64}}) at /root/.julia/packages/CuArrays/7z7MV/src/gpuarray_interface.jl:62
 [18] gpu_call(::Function, ::CuArray{Bool,1,Nothing}, ::Tuple{CuArray{Bool,1,Nothing},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},var"#10#11",Tuple{Base.Broadcast.Extruded{CuArray{Float32,1,Nothing},Tuple{Bool},Tuple{Int64}}}}}, ::Int64) at /root/.julia/packages/GPUArrays/0lvhc/src/abstract_gpu_interface.jl:151
 [19] gpu_call at /root/.julia/packages/GPUArrays/0lvhc/src/abstract_gpu_interface.jl:128 [inlined]
 [20] copyto! at /root/.julia/packages/GPUArrays/0lvhc/src/broadcast.jl:48 [inlined]
 [21] copyto! at ./broadcast.jl:863 [inlined]
 [22] copy at ./broadcast.jl:839 [inlined]
 [23] materialize(::Base.Broadcast.Broadcasted{Base.Broadcast.ArrayStyle{CuArray},Nothing,var"#10#11",Tuple{CuArray{Float32,1,Nothing}}}) at ./broadcast.jl:819
 [24] map(::Function, ::CuArray{Float32,1,Nothing}) at /root/.julia/packages/GPUArrays/0lvhc/src/base.jl:9
 [25] filter(::Function, ::CuArray{Float32,1,Nothing}) at /root/.julia/packages/GPUArrays/0lvhc/src/abstractarray.jl:263
 [26] top-level scope at In[8]:1

Maybe try generating the random numbers into a separate array first.