We have a rich collection of AD, GPU and ML tools and it may be useful to have common benchmarks to find performance bottlenecks, share optimization tricks and push them forward. I started fiddling around in Klutz.jl, so far with only an MLP and a CNN notebook implemented using Flux and Knet (planning an RNN notebook and a GPUArray benchmark next):
MLP CNN MLPcpu CNNcpu
Knet 0.58s 2.77s 3.67s 176.9s
Flux 2.40s 9.77s 8.41s 67.3s
My Flux is rusty so I may not have the most efficient implementation. I got the examples from the Flux model-zoo and tried to define equivalent Knet models as similar as practical, e.g.:
m = Chain(Dense(28^2, 32, relu),Dense(32, 10),softmax) |> gpu # Flux
km = kChain(kDense(28^2, 32, Knet.relu),kDense(32, 10)) # Knet
To help profiling AutoGrad and Knet use TimerOutputs.jl if the environment variables AUTOGRAD_TIMER
and KNET_TIMER
are defined at compile time (the Julia profiler crashes with GPU code for some reason I still donβt understand). It may be helpful to do something similar in Flux to see where the extra time is spent. The AutoGrad timer gives information about tape operations (entries with [n] indicate the backward step wrt the nβth argument).
julia> AutoGrad.to
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Time Allocations
ββββββββββββββββββββββ βββββββββββββββββββββββ
Tot / % measured: 907ms / 73.0% 240MiB / 59.3%
Section ncalls time %tot avg alloc %tot avg
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
*[1] 402 158ms 23.8% 393ΞΌs 283KiB 0.19% -
Knet.A_mul_Bt 201 36.0ms 5.44% 179ΞΌs 104KiB 0.07% -
* 402 113ms 17.0% 281ΞΌs 208KiB 0.14% -
+.[2] 402 74.1ms 11.2% 184ΞΌs 302KiB 0.21% -
Knet.cudnnSoftmaxForward 201 55.7ms 8.41% 277ΞΌs 219KiB 0.15% 1.09KiB
getindex 201 54.6ms 8.24% 272ΞΌs 92.4MiB 64.9% 471KiB
sum_outgrads 3.22k 51.0ms 7.70% 15.9ΞΌs 46.3MiB 32.5% 14.7KiB
Knet.cudnnSoftmaxForward[1] 201 44.8ms 6.77% 223ΞΌs 295KiB 0.20% 1.47KiB
...
The Knet timer gives information about which GPU operations take the most time:
julia> Knet.to
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Time Allocations
ββββββββββββββββββββββ βββββββββββββββββββββββ
Tot / % measured: 909ms / 69.2% 240MiB / 0.59%
Section ncalls time %tot avg alloc %tot avg
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
cublasSgemm_v2 1.00k 271ms 43.0% 269ΞΌs 126KiB 8.62% -
sum_32_21 402 66.9ms 10.6% 166ΞΌs 25.1KiB 1.72% -
cudnnSoftmaxForward 201 53.6ms 8.53% 267ΞΌs 171KiB 11.7% -
cudnnCreateTensorDescriptor 402 1.55ms 0.25% 3.86ΞΌs - 0.00% -
cudnnSetTensorNdDescriptor 402 1.29ms 0.20% 3.21ΞΌs - 0.00% -
cudnnSoftmaxBackward 201 42.2ms 6.71% 210ΞΌs 228KiB 15.6% 1.13KiB
cudnnCreateTensorDescriptor 603 2.31ms 0.37% 3.83ΞΌs - 0.00% -
cudnnSetTensorNdDescriptor 603 1.91ms 0.30% 3.17ΞΌs - 0.00% -
cublasSaxpy_v2 2.40k 30.8ms 4.89% 12.8ΞΌs 225KiB 15.4% -
Please feel free to contribute improvements or new notebooks that compare AD, GPU and/or ML tools.
best,
deniz