I am using Tullio.jl to implement a locally connected layer, which is like a convolutional layer except there isn’t any weight sharing. I found Tullio very useful for this task, but unfortunately there is a large gap in performance between CPU and GPU. On CPU my convlocal function is nearly as fast as NNlib.conv. On GPU the forwards pass is fast, but the backwards pass is very slow. I could use tips on how to improve the performance of the gradient computation on GPU.
Here is a MWE:
using Tullio, NNlib, Flux, BenchmarkTools, CUDA, CUDAKernels, KernelAbstractions
function convlocal(x, W)
@tullio fastmath=false c[s, t, c2, b] := x[s+i-1, t+j-1, c1, b] * W[s+i-1, t+j-1, i, j, c1, c2]
return c
end
device = gpu
kernel_width, kernel_height, ch_in, ch_out = 3, 3, 16, 16
img_width, img_height = 28, 28
batchsize = 100
x = device(rand(Float32, img_width, img_height, ch_in, batchsize))
Wconv = device(rand(Float32, kernel_width, kernel_height, ch_in, ch_out))
Wlocal = device(rand(Float32, img_width, img_height, kernel_width, kernel_height, ch_in, ch_out))
ps_conv = Flux.params(Wconv)
ps_local = Flux.params(Wlocal)
@info "Benchmarking convlocal"
@info "forward"
@btime convlocal(x, Wlocal)
@info "backward"
@btime gradient(() -> sum(abs2, convlocal(x, Wlocal)), ps_local)
println("=====================")
@info "Benchmarking NNlib.conv"
@info "forward"
@btime conv(x, Wconv)
@info "backward"
@btime gradient(() -> sum(abs2, NNlib.conv(x, Wconv)), ps_conv)
println("=====================")
Which on my machine gives the output:
[ Info: Benchmarking convlocal
[ Info: forward
24.326 μs (151 allocations: 7.19 KiB)
[ Info: backward
3.036 s (629 allocations: 31.00 KiB)
=====================
[ Info: Benchmarking NNlib.conv
[ Info: forward
17.402 μs (75 allocations: 2.75 KiB)
[ Info: backward
325.009 μs (401 allocations: 23.81 KiB)
=====================