I am using Tullio.jl to implement a locally connected layer, which is like a convolutional layer except there isn’t any weight sharing. I found Tullio very useful for this task, but unfortunately there is a large gap in performance between CPU and GPU. On CPU my convlocal function is nearly as fast as NNlib.conv. On GPU the forwards pass is fast, but the backwards pass is very slow. I could use tips on how to improve the performance of the gradient computation on GPU.
Here is a MWE:
using Tullio, NNlib, Flux, BenchmarkTools, CUDA, CUDAKernels, KernelAbstractions function convlocal(x, W) @tullio fastmath=false c[s, t, c2, b] := x[s+i-1, t+j-1, c1, b] * W[s+i-1, t+j-1, i, j, c1, c2] return c end device = gpu kernel_width, kernel_height, ch_in, ch_out = 3, 3, 16, 16 img_width, img_height = 28, 28 batchsize = 100 x = device(rand(Float32, img_width, img_height, ch_in, batchsize)) Wconv = device(rand(Float32, kernel_width, kernel_height, ch_in, ch_out)) Wlocal = device(rand(Float32, img_width, img_height, kernel_width, kernel_height, ch_in, ch_out)) ps_conv = Flux.params(Wconv) ps_local = Flux.params(Wlocal) @info "Benchmarking convlocal" @info "forward" @btime convlocal(x, Wlocal) @info "backward" @btime gradient(() -> sum(abs2, convlocal(x, Wlocal)), ps_local) println("=====================") @info "Benchmarking NNlib.conv" @info "forward" @btime conv(x, Wconv) @info "backward" @btime gradient(() -> sum(abs2, NNlib.conv(x, Wconv)), ps_conv) println("=====================")
Which on my machine gives the output:
[ Info: Benchmarking convlocal [ Info: forward 24.326 μs (151 allocations: 7.19 KiB) [ Info: backward 3.036 s (629 allocations: 31.00 KiB) ===================== [ Info: Benchmarking NNlib.conv [ Info: forward 17.402 μs (75 allocations: 2.75 KiB) [ Info: backward 325.009 μs (401 allocations: 23.81 KiB) =====================