Knet vs Flux etc

We have a rich collection of AD, GPU and ML tools and it may be useful to have common benchmarks to find performance bottlenecks, share optimization tricks and push them forward. I started fiddling around in Klutz.jl, so far with only an MLP and a CNN notebook implemented using Flux and Knet (planning an RNN notebook and a GPUArray benchmark next):

      MLP    CNN    MLPcpu CNNcpu
Knet  0.58s  2.77s  3.67s  176.9s
Flux  2.40s  9.77s  8.41s  67.3s

My Flux is rusty so I may not have the most efficient implementation. I got the examples from the Flux model-zoo and tried to define equivalent Knet models as similar as practical, e.g.:

m = Chain(Dense(28^2, 32, relu),Dense(32, 10),softmax) |> gpu  # Flux
km = kChain(kDense(28^2, 32, Knet.relu),kDense(32, 10))  # Knet

To help profiling AutoGrad and Knet use TimerOutputs.jl if the environment variables AUTOGRAD_TIMER and KNET_TIMER are defined at compile time (the Julia profiler crashes with GPU code for some reason I still don’t understand). It may be helpful to do something similar in Flux to see where the extra time is spent. The AutoGrad timer gives information about tape operations (entries with [n] indicate the backward step wrt the n’th argument).

                                               Time                   Allocations      
                                       ──────────────────────   ───────────────────────
           Tot / % measured:                907ms / 73.0%            240MiB / 59.3%    

 Section                       ncalls     time   %tot     avg     alloc   %tot      avg
 *[1]                             402    158ms  23.8%   393ΞΌs    283KiB  0.19%        -
   Knet.A_mul_Bt                  201   36.0ms  5.44%   179ΞΌs    104KiB  0.07%        -
 *                                402    113ms  17.0%   281ΞΌs    208KiB  0.14%        -
 +.[2]                            402   74.1ms  11.2%   184ΞΌs    302KiB  0.21%        -
 Knet.cudnnSoftmaxForward         201   55.7ms  8.41%   277ΞΌs    219KiB  0.15%  1.09KiB
 getindex                         201   54.6ms  8.24%   272ΞΌs   92.4MiB  64.9%   471KiB
 sum_outgrads                   3.22k   51.0ms  7.70%  15.9ΞΌs   46.3MiB  32.5%  14.7KiB
 Knet.cudnnSoftmaxForward[1]      201   44.8ms  6.77%   223ΞΌs    295KiB  0.20%  1.47KiB

The Knet timer gives information about which GPU operations take the most time:

                                                 Time                   Allocations      
                                         ──────────────────────   ───────────────────────
            Tot / % measured:                 909ms / 69.2%            240MiB / 0.59%    

 Section                         ncalls     time   %tot     avg     alloc   %tot      avg
 cublasSgemm_v2                   1.00k    271ms  43.0%   269ΞΌs    126KiB  8.62%        -
 sum_32_21                          402   66.9ms  10.6%   166ΞΌs   25.1KiB  1.72%        -
 cudnnSoftmaxForward                201   53.6ms  8.53%   267ΞΌs    171KiB  11.7%        -
   cudnnCreateTensorDescriptor      402   1.55ms  0.25%  3.86ΞΌs         -  0.00%        -
   cudnnSetTensorNdDescriptor       402   1.29ms  0.20%  3.21ΞΌs         -  0.00%        -
 cudnnSoftmaxBackward               201   42.2ms  6.71%   210ΞΌs    228KiB  15.6%  1.13KiB
   cudnnCreateTensorDescriptor      603   2.31ms  0.37%  3.83ΞΌs         -  0.00%        -
   cudnnSetTensorNdDescriptor       603   1.91ms  0.30%  3.17ΞΌs         -  0.00%        -
 cublasSaxpy_v2                   2.40k   30.8ms  4.89%  12.8ΞΌs    225KiB  15.4%        -

Please feel free to contribute improvements or new notebooks that compare AD, GPU and/or ML tools.



This is a great start! It would also be nice to compare CPU-only performance.

1 Like

Which OS & version of Julia? Should be fixed on Linux

I still get segfaults in Linux with Julia 1.0.1, although not consistently and for possibly a different reason:

Thanks for the suggestion! It would be nice to distinguish factors related to GPU usage from general code optimization. Just added cpu tests to the repo and updated the table in the original post.

1 Like

Would it be fair to say that it’s harder to define custom layers in Knet vs Flux? For example, I looked at the code for Knet once and maxpooling is controlled by a parameter to 1 and meanpooling is done by setting the parameter to 0 (or something like).

The low level functions conv4 and pool in Knet provide access to even lower level cudnn library functions, making the call signatures a bit more palatable while having all the low level options still available with reasonable defaults (also providing cpu replacements). Your observation is accurate, the pool function takes a mode keyword argument. I suspect in a complex model you would start by defining a higher level interface, maybe something like this:

struct Conv; w; b; end
(f::Conv)(x) = pool(conv4(f.w,x) .+ f.b)
Conv(w1,w2,cx,cy) = Conv(param(w1,w2,cx,cy), param0(1,1,cy,1))

Defining a custom layer is basically three lines of code in both Knet and Flux.


This is great @denizyuret, thanks for putting it together. I’m starting to put more time into performance work generally and this will help narrow down priorities – there are a definitely a couple things that are easy to fix but will give a nice speedup on this kind of thing (e.g.).

One suggestion I have, in terms of sharing more code in the ecosystem, is to try moving Knet to CuArrays and benchmarking that. With the whole Flux stack there are a lot of moving parts, but you could just e.g. move to CuArrays as an allocator and keep the Knet GPU kernels, which would make it easy to see where performance changes are coming from. Ping me if I can help with this, and I’ll try to have a go at it over the next couple of weeks.

Sounds good. I think there are three sources of potential speed-up:

  • AD: AutoGrad vs Flux vs Zygote vs Capstain etc.
  • Alloc: KnetArray vs CuArray vs CPU etc.
  • Kernels: Knet kernels vs CUDANative/Flux vs CPU etc.

My GPU experiments vary all 3 components, which makes it difficult to pinpoint causes. My CPU experiments only vary the AD, so that can give us some clues right away. I think I can easily run Knet with CuArray alloc/kernels which should give another AD comparison. Your suggestion of using CuArray allocator with Knet kernels should highlight allocator differences. This is a bit more difficult to implement (the kernels dispatch based on the KnetArray type) but doable. We can probably figure out other combinations of the above three components that will inform the optimization work.

1 Like

They also vary quite a bit in e.g. the implementations we have for broadcast (Flux’s mixed-mode vs Knet’s fission) and some kernels (e.g. NNlib’s pure-Julia convolutions vs Knet’s threaded C++ ones, although NNlib is soon to move to NNPACK). Broadcasting is pretty sub-optimal right now which is probably the main reason for the MLP benchmark difference.

I think you have all the main issues laid out though. My expectation here is that whereas a lot of the infrastructure on the Flux side is well optimised (e.g. AD, especially with Zygote), Knet does a better job with the β€œlong tail” of well-optimised kernels for all the key operations (a bad conv can crush you even with the best AD in the world). But the combination of the two will be unstoppable :slight_smile: