Advice for optimising code for GPU

Hello!

I have recently ported a matlab package to Julia (and found the process suprisingly straight forward), however I with to be able to execute it on the GPU as I with to use it as part of a loss function in an ML project I have. Currently it runs on the GPU very slowly, with considerable allocations. Since the code was ported quite directly from matlab, I’m sure there are better and more GPU friendly ways to structure the code and was hoping for any guidance. I have reduced my code to what I think is the key issues below:

function calc_mi(k,X, Y)
        input = [X[1:end-1,k] Y[2:end,1:end]]
        Dx    = size(X, 2)
        rho   = cor(input)
        mi    = 0.5 * logdet(rho[1:Dx, 1:Dx] + logdet(rho[Dx+1:end, Dx+1:end]) - logdet(rho))
        return mi
end

function calc_ψ(X,Y)
        mapreduce(k -> calc_mi(k,X,Y),+, 1:size(X,2))
end

Obviously scalar indexing is a key issue, but I’m not sure how to convert it to a broadcast call since I’m trying to extract submatricies rather than individual values.

One other issue that I’m not sure how to deal with is the first line of the calc_mi function (which is essentially a call to hcat I think), which will end up allocating each call. I looked at how I might use views to avoid allocations, found a few almost helpful packages (such as GitHub - ahwillia/CatViews.jl: Concatenated Array views in Julia.), tried to roll my own struct to handle indexing into two views but ultimately felt I may be goind down the wrong path.

I also found the Folds and FoldsCUDA library, but when trying to use the CUDAEx() executor, I got lots of errors about dynamic methods calls (I’m guess it tries to construct a CUDA kernel with the contents of my mapreduce function or something?).

Any guidance greatly appreciated :pray:

Nothing specific to GPUs, but instead of stacking X and Y you could define seaparate correlation matrices rhoXX, rhoXY, and rhoYY and then you would not get additional copies in the logdet calls.

As for GPU, what types are you calling you calling your function on? Have you disabled scalar indexing to see if and where it happens?

Thanks for your reply!
I will investigate the allocations saved if I do as you suggest - but my initial tests show that doing cor(x) cor(y) and cor(x,y) will likely lead to more allocations than a cor([x y]), unless I’ve misunderstood your suggestion?

As for scalar indexing, lines 1 and 4 of calc_mi are both culprits as you can see. Perhaps there is some way to restructure the data such that a broadcast would work? I can’t quite see it…

This would be easier if you gave a minimal working example. I’m struggling guessing dimensions of inputs properly. cor from Statistics or somewhere else?
I’m only getting the mi line to run if I change + and - to .+ and .- Or is that a problem with the input geometry?

In any case, my guess is one of the biggest initial problems is
input = [X[1:end-1,k] Y[2:end,1:end]] creates submatrixes before reassigning.

You probably need something like input = [(@view X[1:end-1,k]) (@view Y[2:end,1:end])]. Although that seems to increase the time and number of total allocations, total memory allocated goes down. Doing the same for the mi line, does seem to decrease the speed and total memory.

julia> function calc_mi(k,X, Y)
               input = [X[1:end-1,k] Y[2:end,1:end]]
               Dx    = size(X, 2)
               rho   = cor(input)
               mi    = 0.5 * logdet(rho[1:Dx, 1:Dx] .+ logdet(rho[Dx+1:end, Dx+1:end]) .- logdet(rho))
               return mi
       end
calc_mi (generic function with 1 method)

julia> calc_mi(10, rand(10,10), rand(10,10))
-14.853146497748485

julia> X=  rand(1000,1000); Y = rand(1000,1000);


julia> function calc_mi2(k,X, Y)
               input = [(@view X[1:end-1,k]) (@view Y[2:end,1:end])]
               Dx    = size(X, 2)
               rho   = cor(input)
               mi    = 0.5 * logdet(rho[1:Dx, 1:Dx] .+ logdet(rho[Dx+1:end, Dx+1:end]) .- logdet(rho))
               return mi
       end
calc_mi2 (generic function with 1 method)

julia> function calc_mi3(k,X, Y)
               input = [(@view X[1:end-1,k]) (@view Y[2:end,1:end])]
               Dx    = size(X, 2)
               rho   = cor(input)
               mi    = 0.5 * logdet((@view rho[1:Dx, 1:Dx]) .+ logdet(@view rho[Dx+1:end, Dx+1:end]) .- logdet(rho))
               return mi
       end
calc_mi3 (generic function with 1 method)

julia> @time calc_mi(10, X, Y);@time calc_mi(10, X, Y)
  0.053458 seconds (31 allocations: 61.098 MiB, 14.52% gc time)
  0.050697 seconds (31 allocations: 61.098 MiB, 11.57% gc time)
-502.3481229661156

julia> @time calc_mi2(10, X, Y);@time calc_mi2(10, X, Y)
  0.048822 seconds (36 allocations: 53.469 MiB, 8.85% gc time)
  0.048568 seconds (36 allocations: 53.469 MiB, 6.76% gc time)
-502.3481229661156

julia> @time calc_mi3(10, X, Y);@time calc_mi3(10, X, Y)
  0.048139 seconds (33 allocations: 45.840 MiB, 6.76% gc time)
  0.045365 seconds (33 allocations: 45.840 MiB, 8.37% gc time)
-502.3481229661156