GPU performance problem of locally connected layer with Tullio

Rasmus_Hoier · August 8, 2022, 11:36am

I am using Tullio.jl to implement a locally connected layer, which is like a convolutional layer except there isn’t any weight sharing. I found Tullio very useful for this task, but unfortunately there is a large gap in performance between CPU and GPU. On CPU my convlocal function is nearly as fast as NNlib.conv. On GPU the forwards pass is fast, but the backwards pass is very slow. I could use tips on how to improve the performance of the gradient computation on GPU.

Here is a MWE:

using Tullio, NNlib, Flux, BenchmarkTools, CUDA, CUDAKernels, KernelAbstractions

function convlocal(x, W)
    @tullio fastmath=false c[s, t, c2, b] := x[s+i-1, t+j-1, c1, b] * W[s+i-1, t+j-1, i, j, c1, c2]
    return c
end

device = gpu
kernel_width, kernel_height, ch_in, ch_out = 3, 3, 16, 16
img_width, img_height = 28, 28
batchsize = 100

x = device(rand(Float32, img_width, img_height, ch_in, batchsize))
Wconv = device(rand(Float32, kernel_width, kernel_height, ch_in, ch_out))
Wlocal = device(rand(Float32, img_width, img_height, kernel_width, kernel_height, ch_in, ch_out))
ps_conv = Flux.params(Wconv)
ps_local = Flux.params(Wlocal)

@info "Benchmarking convlocal"
@info "forward"
@btime convlocal(x, Wlocal)
@info "backward"
@btime gradient(() -> sum(abs2, convlocal(x, Wlocal)), ps_local)
println("=====================")
@info "Benchmarking NNlib.conv"
@info "forward"
@btime conv(x, Wconv)
@info "backward"
@btime gradient(() -> sum(abs2, NNlib.conv(x, Wconv)), ps_conv)
println("=====================")

Which on my machine gives the output:

[ Info: Benchmarking convlocal
[ Info: forward
  24.326 μs (151 allocations: 7.19 KiB)
[ Info: backward
  3.036 s (629 allocations: 31.00 KiB)
=====================
[ Info: Benchmarking NNlib.conv
[ Info: forward
  17.402 μs (75 allocations: 2.75 KiB)
[ Info: backward
  325.009 μs (401 allocations: 23.81 KiB)
=====================

johnbb · August 18, 2022, 9:08am

Have you considered @ein? In a brief test, I found that @ein was faster than @tullio on GPU with a 1×1 filter, please see here. Would be great to a fast locally connected layer in Flux.

Rasmus_Hoier · August 19, 2022, 11:31am

Thanks,
I had a look at @ein (from OMEinsum right?), but I could not get it to work with 3x3 filters. Also it didn’t seem to work with autodiff, so I would need to define a custom rrule anyways.

So for now since Tullio was only slow on gpu in the backwards pass I am trying out wrapping the symbolic gradients (provided by @Tullio verbose=1) in their own functions. This solves the GPU runtime issue.

But when comparing the gradients produced by the explicitly defined functions to those produced by Tullios auto generated functions, there is some difference. Gradients with respect to W always match, but gradients with respect to x only match when ch_out=1.

Here is a small cpu based MWE:

using Tullio, Zygote

function convlocal(x, W)
    @tullio verbose=0 fastmath=false c[s, t, c2, b] := x[s+i-1, t+j-1, c1, b] * W[s+i-1, t+j-1, i, j, c1, c2]
    return c
end

function ∇W(x, Δy, ΔW)
    @tullio verbose=0 fastmath=false grad=false (ΔW[(s + i) - 1, (t + j) - 1, i, j, c1, c2] = ΔW[(s + i) - 1, (t + j) - 1, i, j, c1, c2] + Δy[s, t, c2, b] * conj(x[(s + i) - 1, (t + j) - 1, c1, b]))
end

function ∇x(W, Δy, Δx)
    @tullio verbose=0 fastmath=false grad=false (Δx[(s + i) - 1, (t + j) - 1, c1, b] = Δx[(s + i) - 1, (t + j) - 1, c1, b] + Δy[s, t, c2, b] * conj(W[(s + i) - 1, (t + j) - 1, i, j, c1, c2]))
end

kernel_width, kernel_height, ch_in, ch_out = 3, 3, 2, 3
img_width, img_height = 10, 10
batchsize = 3
x = rand(Float32, img_width, img_height, ch_in, batchsize)
W = rand(Float32, img_width, img_height, kernel_width, kernel_height, ch_in, ch_out)
Δy = rand(Float32, img_width-2, img_height-2, ch_out, batchsize)
ΔW = zeros(Float32, size(W))   
Δx = zeros(Float32, size(x))
       
# Compute grads wrt x and W using the explicitly defined functions
∇x(W, Δy, Δx)
∇W(x, Δy, ΔW)

# compute grads wrt x and W using Tullios auto generated gradients
G = Zygote._pullback(convlocal, x, W)
Δx2 = G[2](Δy)[2]
ΔW2 = G[2](Δy)[3]

# compare the two approaches
@show Δx ≈ Δx2
@show ΔW ≈ ΔW2

Producing the output

Δx ≈ Δx2 = false
ΔW ≈ ΔW2 = true

Rasmus_Hoier · August 23, 2022, 8:01am

I got some useful feedback from the Tullio dev regarding the issue with ∇x here.
Also I discovered that timing on GPU requires using CUDA.@sync, so gpu benchmarks look a bit different, but still slow compared to NNlib.conv.

Long story short: Tullio does an extremely good job for locally connected layers on the CPU, but is quite slow on the gpu (both forwards and backwards). An efficient GPU implementation probably needs to be written directly either in KernelAbstractions.jl or CUDA.jl. However, the KernelAbstractions code produced by Tullio (printed when setting @tullio verbose=2) is a useful starting point.

Sidenote: An alternative approach is the fold/unfold based approach proposed in pytorch: link1, link2(though this remains unreleased for several years now). The problem with this approach is that locally connected layers already require much more memory than the corresponding convolutional layer, and using unfold could make this memory issue worse. On top of that memory is often the bottleneck on GPU.

Topic		Replies	Views
Flux: Locally connected layer performance Machine Learning gpu , performance , flux , tullio , loopvectorization	2	543	February 21, 2021
Taking the gradient of an einsum Machine Learning cuda , zygote , tullio	4	306	August 21, 2023
Fast GPU Kernels differentiable with Zygote GPU tullio	7	1792	March 9, 2021
Tullio seems two times slower than basic LoopVectorization Performance question , tullio , loopvectorization	3	1059	April 10, 2022
Understanding Tullio performances Performance tullio , jax	12	401	June 10, 2025

GPU performance problem of locally connected layer with Tullio

Related topics