Advice for optimising code for GPU

maaaks · March 11, 2023, 9:09pm

Hello!

I have recently ported a matlab package to Julia (and found the process suprisingly straight forward), however I with to be able to execute it on the GPU as I with to use it as part of a loss function in an ML project I have. Currently it runs on the GPU very slowly, with considerable allocations. Since the code was ported quite directly from matlab, I’m sure there are better and more GPU friendly ways to structure the code and was hoping for any guidance. I have reduced my code to what I think is the key issues below:

function calc_mi(k,X, Y)
        input = [X[1:end-1,k] Y[2:end,1:end]]
        Dx    = size(X, 2)
        rho   = cor(input)
        mi    = 0.5 * logdet(rho[1:Dx, 1:Dx] + logdet(rho[Dx+1:end, Dx+1:end]) - logdet(rho))
        return mi
end

function calc_ψ(X,Y)
        mapreduce(k -> calc_mi(k,X,Y),+, 1:size(X,2))
end

Obviously scalar indexing is a key issue, but I’m not sure how to convert it to a broadcast call since I’m trying to extract submatricies rather than individual values.

One other issue that I’m not sure how to deal with is the first line of the calc_mi function (which is essentially a call to hcat I think), which will end up allocating each call. I looked at how I might use views to avoid allocations, found a few almost helpful packages (such as GitHub - ahwillia/CatViews.jl: Concatenated Array views in Julia.), tried to roll my own struct to handle indexing into two views but ultimately felt I may be goind down the wrong path.

I also found the Folds and FoldsCUDA library, but when trying to use the CUDAEx() executor, I got lots of errors about dynamic methods calls (I’m guess it tries to construct a CUDA kernel with the contents of my mapreduce function or something?).

Any guidance greatly appreciated

simsurace · March 11, 2023, 9:28pm

Nothing specific to GPUs, but instead of stacking X and Y you could define seaparate correlation matrices rhoXX, rhoXY, and rhoYY and then you would not get additional copies in the logdet calls.

As for GPU, what types are you calling you calling your function on? Have you disabled scalar indexing to see if and where it happens?

maaaks · March 13, 2023, 9:21am

Thanks for your reply!
I will investigate the allocations saved if I do as you suggest - but my initial tests show that doing cor(x) cor(y) and cor(x,y) will likely lead to more allocations than a cor([x y]), unless I’ve misunderstood your suggestion?

As for scalar indexing, lines 1 and 4 of calc_mi are both culprits as you can see. Perhaps there is some way to restructure the data such that a broadcast would work? I can’t quite see it…

lwabeke · March 13, 2023, 10:39am

This would be easier if you gave a minimal working example. I’m struggling guessing dimensions of inputs properly. cor from Statistics or somewhere else?
I’m only getting the mi line to run if I change + and - to .+ and .- Or is that a problem with the input geometry?

In any case, my guess is one of the biggest initial problems is
input = [X[1:end-1,k] Y[2:end,1:end]] creates submatrixes before reassigning.

You probably need something like input = [(@view X[1:end-1,k]) (@view Y[2:end,1:end])]. Although that seems to increase the time and number of total allocations, total memory allocated goes down. Doing the same for the mi line, does seem to decrease the speed and total memory.

julia> function calc_mi(k,X, Y)
               input = [X[1:end-1,k] Y[2:end,1:end]]
               Dx    = size(X, 2)
               rho   = cor(input)
               mi    = 0.5 * logdet(rho[1:Dx, 1:Dx] .+ logdet(rho[Dx+1:end, Dx+1:end]) .- logdet(rho))
               return mi
       end
calc_mi (generic function with 1 method)

julia> calc_mi(10, rand(10,10), rand(10,10))
-14.853146497748485

julia> X=  rand(1000,1000); Y = rand(1000,1000);


julia> function calc_mi2(k,X, Y)
               input = [(@view X[1:end-1,k]) (@view Y[2:end,1:end])]
               Dx    = size(X, 2)
               rho   = cor(input)
               mi    = 0.5 * logdet(rho[1:Dx, 1:Dx] .+ logdet(rho[Dx+1:end, Dx+1:end]) .- logdet(rho))
               return mi
       end
calc_mi2 (generic function with 1 method)

julia> function calc_mi3(k,X, Y)
               input = [(@view X[1:end-1,k]) (@view Y[2:end,1:end])]
               Dx    = size(X, 2)
               rho   = cor(input)
               mi    = 0.5 * logdet((@view rho[1:Dx, 1:Dx]) .+ logdet(@view rho[Dx+1:end, Dx+1:end]) .- logdet(rho))
               return mi
       end
calc_mi3 (generic function with 1 method)

julia> @time calc_mi(10, X, Y);@time calc_mi(10, X, Y)
  0.053458 seconds (31 allocations: 61.098 MiB, 14.52% gc time)
  0.050697 seconds (31 allocations: 61.098 MiB, 11.57% gc time)
-502.3481229661156

julia> @time calc_mi2(10, X, Y);@time calc_mi2(10, X, Y)
  0.048822 seconds (36 allocations: 53.469 MiB, 8.85% gc time)
  0.048568 seconds (36 allocations: 53.469 MiB, 6.76% gc time)
-502.3481229661156

julia> @time calc_mi3(10, X, Y);@time calc_mi3(10, X, Y)
  0.048139 seconds (33 allocations: 45.840 MiB, 6.76% gc time)
  0.045365 seconds (33 allocations: 45.840 MiB, 8.37% gc time)
-502.3481229661156

Topic		Replies	Views
How do I to transform mapreduce function to work well with CUDA? GPU	5	1527	May 14, 2021
GPU Julia vs GPU Matlab New to Julia gpu	61	956	November 18, 2024
Problem with GPU programming GPU cudanative , cuda	4	1055	September 13, 2019
CartesianIndices, Convolution and other "moving window" computations on GPU New to Julia gpu , indexing	0	629	January 10, 2019
Why Julia is much slower than MATLAB on GPU computing? GPU matlab , cuda	30	3159	November 20, 2023

Advice for optimising code for GPU

Related topics