Knet vs Flux etc

denizyuret · November 1, 2018, 5:26pm

We have a rich collection of AD, GPU and ML tools and it may be useful to have common benchmarks to find performance bottlenecks, share optimization tricks and push them forward. I started fiddling around in Klutz.jl, so far with only an MLP and a CNN notebook implemented using Flux and Knet (planning an RNN notebook and a GPUArray benchmark next):

      MLP    CNN    MLPcpu CNNcpu
Knet  0.58s  2.77s  3.67s  176.9s
Flux  2.40s  9.77s  8.41s  67.3s

My Flux is rusty so I may not have the most efficient implementation. I got the examples from the Flux model-zoo and tried to define equivalent Knet models as similar as practical, e.g.:

m = Chain(Dense(28^2, 32, relu),Dense(32, 10),softmax) |> gpu  # Flux
km = kChain(kDense(28^2, 32, Knet.relu),kDense(32, 10))  # Knet

To help profiling AutoGrad and Knet use TimerOutputs.jl if the environment variables AUTOGRAD_TIMER and KNET_TIMER are defined at compile time (the Julia profiler crashes with GPU code for some reason I still don’t understand). It may be helpful to do something similar in Flux to see where the extra time is spent. The AutoGrad timer gives information about tape operations (entries with [n] indicate the backward step wrt the n’th argument).

julia> AutoGrad.to
──────────────────────────────────────────────────────────────────────────────────────
                                               Time                   Allocations      
                                       ──────────────────────   ───────────────────────
           Tot / % measured:                907ms / 73.0%            240MiB / 59.3%    

 Section                       ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────────────────
 *[1]                             402    158ms  23.8%   393μs    283KiB  0.19%        -
   Knet.A_mul_Bt                  201   36.0ms  5.44%   179μs    104KiB  0.07%        -
 *                                402    113ms  17.0%   281μs    208KiB  0.14%        -
 +.[2]                            402   74.1ms  11.2%   184μs    302KiB  0.21%        -
 Knet.cudnnSoftmaxForward         201   55.7ms  8.41%   277μs    219KiB  0.15%  1.09KiB
 getindex                         201   54.6ms  8.24%   272μs   92.4MiB  64.9%   471KiB
 sum_outgrads                   3.22k   51.0ms  7.70%  15.9μs   46.3MiB  32.5%  14.7KiB
 Knet.cudnnSoftmaxForward[1]      201   44.8ms  6.77%   223μs    295KiB  0.20%  1.47KiB
...

The Knet timer gives information about which GPU operations take the most time:

julia> Knet.to
────────────────────────────────────────────────────────────────────────────────────────
                                                 Time                   Allocations      
                                         ──────────────────────   ───────────────────────
            Tot / % measured:                 909ms / 69.2%            240MiB / 0.59%    

 Section                         ncalls     time   %tot     avg     alloc   %tot      avg
 ────────────────────────────────────────────────────────────────────────────────────────
 cublasSgemm_v2                   1.00k    271ms  43.0%   269μs    126KiB  8.62%        -
 sum_32_21                          402   66.9ms  10.6%   166μs   25.1KiB  1.72%        -
 cudnnSoftmaxForward                201   53.6ms  8.53%   267μs    171KiB  11.7%        -
   cudnnCreateTensorDescriptor      402   1.55ms  0.25%  3.86μs         -  0.00%        -
   cudnnSetTensorNdDescriptor       402   1.29ms  0.20%  3.21μs         -  0.00%        -
 cudnnSoftmaxBackward               201   42.2ms  6.71%   210μs    228KiB  15.6%  1.13KiB
   cudnnCreateTensorDescriptor      603   2.31ms  0.37%  3.83μs         -  0.00%        -
   cudnnSetTensorNdDescriptor       603   1.91ms  0.30%  3.17μs         -  0.00%        -
 cublasSaxpy_v2                   2.40k   30.8ms  4.89%  12.8μs    225KiB  15.4%        -

Please feel free to contribute improvements or new notebooks that compare AD, GPU and/or ML tools.

best,
deniz

rickhg12hs · November 1, 2018, 6:02pm

This is a great start! It would also be nice to compare CPU-only performance.

maleadt · November 1, 2018, 7:20pm

Which OS & version of Julia? Should be fixed on Linux Ignore SEGV during profiler unwind on Unix by maleadt · Pull Request #28291 · JuliaLang/julia · GitHub

denizyuret · November 1, 2018, 7:43pm

I still get segfaults in Linux with Julia 1.0.1, although not consistently and for possibly a different reason: profile-segfault.log · GitHub

kristoffer.carlsson · November 1, 2018, 7:47pm

https://github.com/JuliaLang/julia/issues/28648

denizyuret · November 1, 2018, 8:56pm

Thanks for the suggestion! It would be nice to distinguish factors related to GPU usage from general code optimization. Just added cpu tests to the repo and updated the table in the original post.

xiaodai · November 2, 2018, 1:54am

Would it be fair to say that it’s harder to define custom layers in Knet vs Flux? For example, I looked at the code for Knet once and maxpooling is controlled by a parameter to 1 and meanpooling is done by setting the parameter to 0 (or something like).

denizyuret · November 2, 2018, 2:23am

The low level functions conv4 and pool in Knet provide access to even lower level cudnn library functions, making the call signatures a bit more palatable while having all the low level options still available with reasonable defaults (also providing cpu replacements). Your observation is accurate, the pool function takes a mode keyword argument. I suspect in a complex model you would start by defining a higher level interface, maybe something like this:

struct Conv; w; b; end
(f::Conv)(x) = pool(conv4(f.w,x) .+ f.b)
Conv(w1,w2,cx,cy) = Conv(param(w1,w2,cx,cy), param0(1,1,cy,1))

Defining a custom layer is basically three lines of code in both Knet and Flux.

MikeInnes · November 9, 2018, 1:01pm

This is great @denizyuret, thanks for putting it together. I’m starting to put more time into performance work generally and this will help narrow down priorities – there are a definitely a couple things that are easy to fix but will give a nice speedup on this kind of thing (e.g.).

One suggestion I have, in terms of sharing more code in the ecosystem, is to try moving Knet to CuArrays and benchmarking that. With the whole Flux stack there are a lot of moving parts, but you could just e.g. move to CuArrays as an allocator and keep the Knet GPU kernels, which would make it easy to see where performance changes are coming from. Ping me if I can help with this, and I’ll try to have a go at it over the next couple of weeks.

denizyuret · November 9, 2018, 2:51pm

Sounds good. I think there are three sources of potential speed-up:

AD: AutoGrad vs Flux vs Zygote vs Capstain etc.
Alloc: KnetArray vs CuArray vs CPU etc.
Kernels: Knet kernels vs CUDANative/Flux vs CPU etc.

My GPU experiments vary all 3 components, which makes it difficult to pinpoint causes. My CPU experiments only vary the AD, so that can give us some clues right away. I think I can easily run Knet with CuArray alloc/kernels which should give another AD comparison. Your suggestion of using CuArray allocator with Knet kernels should highlight allocator differences. This is a bit more difficult to implement (the kernels dispatch based on the KnetArray type) but doable. We can probably figure out other combinations of the above three components that will inform the optimization work.

MikeInnes · November 9, 2018, 3:33pm

They also vary quite a bit in e.g. the implementations we have for broadcast (Flux’s mixed-mode vs Knet’s fission) and some kernels (e.g. NNlib’s pure-Julia convolutions vs Knet’s threaded C++ ones, although NNlib is soon to move to NNPACK). Broadcasting is pretty sub-optimal right now which is probably the main reason for the MLP benchmark difference.

I think you have all the main issues laid out though. My expectation here is that whereas a lot of the infrastructure on the Flux side is well optimised (e.g. AD, especially with Zygote), Knet does a better job with the “long tail” of well-optimised kernels for all the key operations (a bad conv can crush you even with the best AD in the world). But the combination of the two will be unstoppable

Topic		Replies	Views
ANN: Knet 1.4.0: accelerating CuArrays Machine Learning	26	3080	September 15, 2020
Knet vs MXNet for programmer new to ML Machine Learning knet	25	6801	October 6, 2018
Knet 1.1.1 is out: performance improvements, new monitoring tools, new gpu memory manager Machine Learning	0	671	October 1, 2018
Flux 3, now with 100% more Julia! Machine Learning flux	18	7481	December 18, 2017
Flux vs Knet for research and production Machine Learning knet , flux	8	1995	March 27, 2022

Knet vs Flux etc

Related topics