Writing simple kernel functions (EM algorithm)

myroslav · May 30, 2019, 10:06pm

Hi everyone,

I am trying to get my head around writing GPU kernels in Julia, would be great if someone could help me with my example. Without going into too many details, I am trying to run a simple EM (Expectation-Maximization) algorithm for a mixture of normal distributions, that essentially loops over many individuals, calculates some values using simple linear algebra, and spits out some other matrices (which are used later on to calculate likelihood). The problem with running this on CPU is that it takes quite a bit of time, as I have to loop over about 20000 individuals, and for each of them calculate some values, so doing this on GPU seems like a good idea. Here is a MWE of what I am trying to do

using LinearAlgebra

Nobs = 100
nM = 2
nZ = 5

y = rand(Nobs, nZ)
y[1, 3] = 0
y[2, 4] = 0
y[2, 5] = 0
y[4, 2] = 0

comp = [findall(x -> x .!= 0, y[i,:]) for i = 1:Nobs]
sigma0 = [Matrix{Float64}(I,nZ,nZ),Matrix{Float64}(I,nZ,nZ)]
mean0 = zeros(2,nZ)

probi = zeros(Nobs, nM)

for i = 1:Nobs, m = 1:nM
        Q = ((1/(2*pi))^(length(comp[i])/2))*((1/det(sigma0[m][comp[i],comp[i]]))^.5)
        invsigma = inv(sigma0[m][comp[i],comp[i]])
        probi[i,m]  =  Q*exp(-.5*transpose(y[i,comp[i]] - mean0[m,comp[i]])*invsigma*(y[i,comp[i]] - mean0[m,comp[i]]))
end

sigmai1 = [zeros(Nobs, nZ, nZ) for m = 1:nM]
prob = 0.5*ones(Nobs, nM)

for i = 1:Nobs, m = 1:nM
        temp = zeros(nZ,nZ)
        temp[comp[i],comp[i]] =  prob[i,m]*(y[i,comp[i]]-mean0[m,comp[i]])*transpose(y[i,comp[i]]-mean0[m,comp[i]])
        sigmai1[m][i,:,:] = temp
end

In the example below y is the data I load, Nobs is the number of people in the data, that gets very large (around 20000), nM is always either 2 (at most or 3). As some data might have missing observations (or 0), comp is an array that keeps a track of non-missing values. What takes plenty of time are the two loops, one to calculate probi and one to calculate sigmai1. So I was thinking about recasting those to the GPU, using CuArrays, CUDAnative, CUDAdrv, etc.

Could someone help on setting up the kernel functions for those two loops? I mostly confused on how to work with arrays of arrays (such as comp).

Thanks in advance!

Myroslav

mohamed82008 · May 31, 2019, 2:14am

Change the array of arrays to some ragged array structure such as: TopOpt.jl/utils.jl at master · JuliaTopOpt/TopOpt.jl · GitHub

myroslav · May 31, 2019, 11:12am

Ok, I see, thanks. That way I can get around with using arrays of arrays.
Assuming comp now is a rectangular array, y will still have some missing observations, so I need to select y[comp] anyway. How to write that in a kernel?

mohamed82008 · May 31, 2019, 1:15pm

Hmm on second look your code is a bit involved. I think your biggest challenge is getting invsigma and finding the determinant. inv and det are not defined properly for CuArrays. There is however a batch inverse function in CuArrays which you can use to pre-compute all the inverses. You may also probably be able to indirectly find the determinants by some batch decomposition (haven’t looked into it).

Now if your matrix is not too big (nZ not much bigger than 10), you can use StaticArrays.jl which supports inv and det. And since all you are doing with the inverse is multiply it by a vector, you can also just use the \ function. The multiplication will also just work with StaticArrays. All of these operations are kernel-friendly because SVector and SMatrix are immutable types and these operations were implemented for them in a way that doesn’t allocate for small-sized matrices.

Note that you will need to make y and mean0 of type CuVector{<:SVector}, s.t. all the SVectors are the same size. So do your pre-processing to pad with zeros as needed. Also, all the matrices should be of the same size and invertible so pad the diagonals with ones and the off-diagonals with zeros as needed.

ennvvy · November 28, 2019, 5:02am

I’m curious to know if you managed to run findall on the GPU? I get an error for it always returns an array whose size is variable.

dpsanders · November 28, 2019, 6:49am

You should be able to do this using filter.

ennvvy · November 28, 2019, 8:14am

I tried this:

A=zeros(Int,40)
A[10:20].=1
cA=cu(A)
filter(x->x==1||x!=1&&CuArrays.rand(1)<0.5,cA)

It gives the following error:

GPU compilation of #25(CuArrays.CuKernelState, CuDeviceArray{Bool,1,CUDAnative.AS.Global}, Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},var"#10#11",Tuple{Base.Broadcast.Extruded{CuDeviceArray{Float32,1,CUDAnative.AS.Global},Tuple{Bool},Tuple{Int64}}}}) failed
KernelError: recursion is currently not supported

Try inspecting the generated code with any of the @device_code_... macros.

Stacktrace:
 [1] _nextind_str at strings/string.jl:140
 [2] nextind at strings/string.jl:136
 [3] _nextind_str at strings/string.jl:140
 [4] _split at strings/util.jl:325
 [5] env_override_minlevel at logging.jl:419
 [6] current_logger_for_env at logging.jl:383
 [7] maybe_initialize at /root/.julia/packages/CUDAnative/2WQzk/src/init.jl:30
 [8] #25 at /root/.julia/packages/GPUArrays/0lvhc/src/broadcast.jl:49

Stacktrace:
 [1] (::CUDAnative.var"#hook_emit_function#100"{CUDAnative.CompilerJob,Array{Core.MethodInstance,1}})(::Core.MethodInstance, ::Core.CodeInfo, ::UInt64) at /root/.julia/packages/CUDAnative/2WQzk/src/compiler/irgen.jl:102
 [2] compile_method_instance(::CUDAnative.CompilerJob, ::Core.MethodInstance, ::UInt64) at /root/.julia/packages/CUDAnative/2WQzk/src/compiler/irgen.jl:149
 [3] macro expansion at /root/.julia/packages/TimerOutputs/7Id5J/src/TimerOutput.jl:214 [inlined]
 [4] irgen(::CUDAnative.CompilerJob, ::Core.MethodInstance, ::UInt64) at /root/.julia/packages/CUDAnative/2WQzk/src/compiler/irgen.jl:163
 [5] #codegen#152(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::typeof(CUDAnative.codegen), ::Symbol, ::CUDAnative.CompilerJob) at /root/.julia/packages/TimerOutputs/7Id5J/src/TimerOutput.jl:214
 [6] #codegen at ./none:0 [inlined]
 [7] #compile#151(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::typeof(CUDAnative.compile), ::Symbol, ::CUDAnative.CompilerJob) at /root/.julia/packages/CUDAnative/2WQzk/src/compiler/driver.jl:47
 [8] #compile at ./none:0 [inlined]
 [9] #compile#150 at /root/.julia/packages/CUDAnative/2WQzk/src/compiler/driver.jl:28 [inlined]
 [10] #compile at ./none:0 [inlined] (repeats 2 times)
 [11] macro expansion at /root/.julia/packages/CUDAnative/2WQzk/src/execution.jl:403 [inlined]
 [12] #cufunction#194(::Nothing, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(cufunction), ::GPUArrays.var"#25#26", ::Type{Tuple{CuArrays.CuKernelState,CuDeviceArray{Bool,1,CUDAnative.AS.Global},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},var"#10#11",Tuple{Base.Broadcast.Extruded{CuDeviceArray{Float32,1,CUDAnative.AS.Global},Tuple{Bool},Tuple{Int64}}}}}}) at /root/.julia/packages/CUDAnative/2WQzk/src/execution.jl:368
 [13] cufunction(::Function, ::Type) at /root/.julia/packages/CUDAnative/2WQzk/src/execution.jl:368
 [14] macro expansion at /root/.julia/packages/CUDAnative/2WQzk/src/execution.jl:176 [inlined]
 [15] macro expansion at ./gcutils.jl:91 [inlined]
 [16] macro expansion at /root/.julia/packages/CUDAnative/2WQzk/src/execution.jl:173 [inlined]
 [17] _gpu_call(::CuArrays.CuArrayBackend, ::Function, ::CuArray{Bool,1,Nothing}, ::Tuple{CuArray{Bool,1,Nothing},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},var"#10#11",Tuple{Base.Broadcast.Extruded{CuArray{Float32,1,Nothing},Tuple{Bool},Tuple{Int64}}}}}, ::Tuple{Tuple{Int64},Tuple{Int64}}) at /root/.julia/packages/CuArrays/7z7MV/src/gpuarray_interface.jl:62
 [18] gpu_call(::Function, ::CuArray{Bool,1,Nothing}, ::Tuple{CuArray{Bool,1,Nothing},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},var"#10#11",Tuple{Base.Broadcast.Extruded{CuArray{Float32,1,Nothing},Tuple{Bool},Tuple{Int64}}}}}, ::Int64) at /root/.julia/packages/GPUArrays/0lvhc/src/abstract_gpu_interface.jl:151
 [19] gpu_call at /root/.julia/packages/GPUArrays/0lvhc/src/abstract_gpu_interface.jl:128 [inlined]
 [20] copyto! at /root/.julia/packages/GPUArrays/0lvhc/src/broadcast.jl:48 [inlined]
 [21] copyto! at ./broadcast.jl:863 [inlined]
 [22] copy at ./broadcast.jl:839 [inlined]
 [23] materialize(::Base.Broadcast.Broadcasted{Base.Broadcast.ArrayStyle{CuArray},Nothing,var"#10#11",Tuple{CuArray{Float32,1,Nothing}}}) at ./broadcast.jl:819
 [24] map(::Function, ::CuArray{Float32,1,Nothing}) at /root/.julia/packages/GPUArrays/0lvhc/src/base.jl:9
 [25] filter(::Function, ::CuArray{Float32,1,Nothing}) at /root/.julia/packages/GPUArrays/0lvhc/src/abstractarray.jl:263
 [26] top-level scope at In[8]:1

dpsanders · November 28, 2019, 2:39pm

Maybe try generating the random numbers into a separate array first.

Topic		Replies	Views
[blog post] Introduction to GPU programming Community gpu , cudanative , gpuarrays , blog-post	15	3326	December 20, 2018
Advice for optimising code for GPU GPU question , cuda	3	367	March 13, 2023
GPU Map without reduction on multiple arrays indices GPU	1	693	February 8, 2019
KernelError: recursion is currently not supported GPU	4	983	January 13, 2020
Problem with GPU programming GPU cudanative , cuda	4	1062	September 13, 2019

Writing simple kernel functions (EM algorithm)

Related topics