GPU performance problem of locally connected layer with Tullio

I am using Tullio.jl to implement a locally connected layer, which is like a convolutional layer except there isn’t any weight sharing. I found Tullio very useful for this task, but unfortunately there is a large gap in performance between CPU and GPU. On CPU my convlocal function is nearly as fast as NNlib.conv. On GPU the forwards pass is fast, but the backwards pass is very slow. I could use tips on how to improve the performance of the gradient computation on GPU.

Here is a MWE:

using Tullio, NNlib, Flux, BenchmarkTools, CUDA, CUDAKernels, KernelAbstractions

function convlocal(x, W)
    @tullio fastmath=false c[s, t, c2, b] := x[s+i-1, t+j-1, c1, b] * W[s+i-1, t+j-1, i, j, c1, c2]
    return c
end

device = gpu
kernel_width, kernel_height, ch_in, ch_out = 3, 3, 16, 16
img_width, img_height = 28, 28
batchsize = 100

x = device(rand(Float32, img_width, img_height, ch_in, batchsize))
Wconv = device(rand(Float32, kernel_width, kernel_height, ch_in, ch_out))
Wlocal = device(rand(Float32, img_width, img_height, kernel_width, kernel_height, ch_in, ch_out))
ps_conv = Flux.params(Wconv)
ps_local = Flux.params(Wlocal)

@info "Benchmarking convlocal"
@info "forward"
@btime convlocal(x, Wlocal)
@info "backward"
@btime gradient(() -> sum(abs2, convlocal(x, Wlocal)), ps_local)
println("=====================")
@info "Benchmarking NNlib.conv"
@info "forward"
@btime conv(x, Wconv)
@info "backward"
@btime gradient(() -> sum(abs2, NNlib.conv(x, Wconv)), ps_conv)
println("=====================")

Which on my machine gives the output:

[ Info: Benchmarking convlocal
[ Info: forward
  24.326 μs (151 allocations: 7.19 KiB)
[ Info: backward
  3.036 s (629 allocations: 31.00 KiB)
=====================
[ Info: Benchmarking NNlib.conv
[ Info: forward
  17.402 μs (75 allocations: 2.75 KiB)
[ Info: backward
  325.009 μs (401 allocations: 23.81 KiB)
=====================
1 Like

Have you considered @ein? In a brief test, I found that @ein was faster than @tullio on GPU with a 1×1 filter, please see here. Would be great to a fast locally connected layer in Flux.

Thanks,
I had a look at @ein (from OMEinsum right?), but I could not get it to work with 3x3 filters. Also it didn’t seem to work with autodiff, so I would need to define a custom rrule anyways.

So for now since Tullio was only slow on gpu in the backwards pass I am trying out wrapping the symbolic gradients (provided by @Tullio verbose=1) in their own functions. This solves the GPU runtime issue.

But when comparing the gradients produced by the explicitly defined functions to those produced by Tullios auto generated functions, there is some difference. Gradients with respect to W always match, but gradients with respect to x only match when ch_out=1.

Here is a small cpu based MWE:

using Tullio, Zygote

function convlocal(x, W)
    @tullio verbose=0 fastmath=false c[s, t, c2, b] := x[s+i-1, t+j-1, c1, b] * W[s+i-1, t+j-1, i, j, c1, c2]
    return c
end

function ∇W(x, Δy, ΔW)
    @tullio verbose=0 fastmath=false grad=false (ΔW[(s + i) - 1, (t + j) - 1, i, j, c1, c2] = ΔW[(s + i) - 1, (t + j) - 1, i, j, c1, c2] + Δy[s, t, c2, b] * conj(x[(s + i) - 1, (t + j) - 1, c1, b]))
end

function ∇x(W, Δy, Δx)
    @tullio verbose=0 fastmath=false grad=false (Δx[(s + i) - 1, (t + j) - 1, c1, b] = Δx[(s + i) - 1, (t + j) - 1, c1, b] + Δy[s, t, c2, b] * conj(W[(s + i) - 1, (t + j) - 1, i, j, c1, c2]))
end

kernel_width, kernel_height, ch_in, ch_out = 3, 3, 2, 3
img_width, img_height = 10, 10
batchsize = 3
x = rand(Float32, img_width, img_height, ch_in, batchsize)
W = rand(Float32, img_width, img_height, kernel_width, kernel_height, ch_in, ch_out)
Δy = rand(Float32, img_width-2, img_height-2, ch_out, batchsize)
ΔW = zeros(Float32, size(W))   
Δx = zeros(Float32, size(x))
       
# Compute grads wrt x and W using the explicitly defined functions
∇x(W, Δy, Δx)
∇W(x, Δy, ΔW)

# compute grads wrt x and W using Tullios auto generated gradients
G = Zygote._pullback(convlocal, x, W)
Δx2 = G[2](Δy)[2]
ΔW2 = G[2](Δy)[3]

# compare the two approaches
@show Δx ≈ Δx2
@show ΔW ≈ ΔW2

Producing the output

Δx ≈ Δx2 = false
ΔW ≈ ΔW2 = true
1 Like

I got some useful feedback from the Tullio dev regarding the issue with ∇x here.
Also I discovered that timing on GPU requires using CUDA.@sync, so gpu benchmarks look a bit different, but still slow compared to NNlib.conv.

Long story short: Tullio does an extremely good job for locally connected layers on the CPU, but is quite slow on the gpu (both forwards and backwards). An efficient GPU implementation probably needs to be written directly either in KernelAbstractions.jl or CUDA.jl. However, the KernelAbstractions code produced by Tullio (printed when setting @tullio verbose=2) is a useful starting point.

Sidenote: An alternative approach is the fold/unfold based approach proposed in pytorch: link1, link2(though this remains unreleased for several years now). The problem with this approach is that locally connected layers already require much more memory than the corresponding convolutional layer, and using unfold could make this memory issue worse. On top of that memory is often the bottleneck on GPU.