Taking the gradient of an einsum

yuvalw · August 20, 2023, 1:45pm

Hello, I would really appreciate some help optimizing my optimization. What I calculate makes the most sense (to me) as an einsum but I cannot seem to take its gradient properly. What I managed to do instead is iterate over eachcols (and when applied to the weights, it is an eachcol of a row matrix), but naturally it is slow.

using Flux, CUDA, Tullio, KernelAbstractions, CUDAKernels
using SpecialFunctions: loggamma
using BenchmarkTools
CUDA.allowscalar(false)

struct mymodel
    weights
end
Flux.@functor mymodel

ll(x::Real, rate::Real) = x*log(rate) - rate - loggamma(x+1)

function loss(loglikelihood)
    y = Flux.onehotbatch(axes(loglikelihood)...) |> gpu
    Flux.logitcrossentropy(loglikelihood, y')
end

# generate some data
n_i = 300
n_j = 250
n_k = 200

rates = CUDA.rand(n_i, n_k)
xs = CUDA.rand(n_j, n_k)
weights = CUDA.rand(1, n_k)

model = mymodel(weights)

# several implementations of loglikelihood computation
"slowest"
function (model::mymodel)(rates, xs) 
    loglikelihood = zeros(size(xs, 1), size(rates, 1)) |> gpu # using `CUDA.zeros` fails for some reason when taking the gradient. Any idea why?
    for (w, xs_col, rates_col) in zip(eachcol(model.weights), eachcol(xs), eachcol(rates))
        loglikelihood += abs.(w) .* ll.(xs_col, rates_col')
    end
    loglikelihood
end

@btime model($rates, $xs); # 1.575 ms (19422 allocations: 2.01 MiB)
@btime Flux.withgradient(model -> loss(model($rates, $xs)), model); # 43.186 ms (451031 allocations: 20.93 MiB)

"somewhat faster"
function (model::mymodel)(rates, xs)
    mapreduce(
        (xs_col, rates_col, w) -> abs.(w) .* ll.(xs_col, rates_col'), +, 
        eachcol(xs), 
        eachcol(rates),
        eachcol(model.weights)
    )
end
@btime model($rates, $xs); # 1.191 ms (16988 allocations: 1.06 MiB)
@btime Flux.withgradient(model -> loss(model($rates, $xs)), model); # 29.738 ms (172123 allocations: 8.56 MiB)


"fastest, taking gradient doesn't work"
function (model::mymodel)(rates, xs) 
    @tullio loglikelihood[i, j] := ll(xs[i, k], rates[j, k]) * abs(model.weights[k]) grad=Dual
end

@btime model($rates, $xs); # 18.067 μs (107 allocations: 4.75 KiB)
@btime Flux.withgradient(model -> loss(model($rates, $xs)), model) # fails. The (shortened) error message is:
# ERROR: InvalidIRError: compiling kernel (...) resulted in invalid LLVM IR
#   Reason: unsupported dynamic function invocation (call to +)

I also tried some other einsum macros (@einsum, @tensor).
Thank you

mcabbott · August 20, 2023, 2:21pm

Are you using CUDA v4? If so I suspect the problem is changes there & to KernelAbstractions. Perhaps fixed in this PR, sorry I haven’t had the bandwidth to look properly.

(This won’t solve your problem, but I believe would be a nice use for nograd=ll if that worked, as this would allow an a symbolic gradient instead of Dual numbers.)

yuvalw · August 20, 2023, 8:01pm

Hi, I am using CUDA v4. Is there an efficient way to implement this without Tullio?

Regarding nograd=ll, I didn’t quite understand what it would do here. I tried both with grad=Dual and without, but naively, as it is just Poisson loglikelihood, I would guess that taking the symbolic gradient here is possible, isn’t it?
Thanks

mcabbott · August 20, 2023, 11:10pm

You could try with the linked PR for CUDAv4.

An alternative is to use broadcasting, e.g. using TensorCast; @reduce loglikelihood[i, j] := sum(k) ll(xs[i, k], rates[j, k]) * abs(model.weights[k]). This is a horrible algorithm, it allocates the whole N^3 size intermediate, but for expressions like this one with matmul-like access, Tullio also isn’t very smart about making efficient GPU code.

Re nograd, perhaps this is deep in the weeds, but my point is that it’s not necessary to know the derivative of ll to compute the gradient with respect to model.weights, you only need *.

yuvalw · August 21, 2023, 8:55am

The PR didn’t help it, I got this error “ERROR: Compiling Tuple{CUDA.var”#new_unique#2", Ptr{CUDA.CUctx_st}}: try/c
Refer to the Zygote documentation for fixes."

Using TensorCast was helpful, and without taking the gradient was actually faster than Tullio.

function (model::mymodel)(rates, xs) 
    @reduce loglikelihood[i, j] := sum(k) ll(xs[i, k], rates[j, k]) * abs(model.weights[1, k])
end
@btime model($rates, $xs); # 6.600 μs (84 allocations: 7.88 KiB)
@btime Flux.withgradient(model -> loss(model($rates, $xs)), $model) # 12.276 ms (1189 allocations: 72.70 KiB)

I was surprised that although running the computation is almost 1000x faster than my naive implementation, taking the gradient was only 2-3 times faster.
Thanks for the help.

Topic		Replies	Views
Gradient of the einsum not working with CuArrays, when it's output is assigned to the field of a mutable struct GPU machine-learning	0	514	February 10, 2020
Zygote with Tullio gives wrong gradients/pullbacks using CUDA Machine Learning	1	218	February 26, 2024
Flux gpu gradient failing General Usage question	5	851	August 18, 2020
GPU gradient calculation with Zygote failling with an RTX 4060 General Usage question	1	178	December 17, 2023
Gradient on GPU 70X slower than on CPU Machine Learning gpu , flux	13	612	October 28, 2022

Taking the gradient of an einsum

Related topics