Knet 1.1.1 is out: performance improvements, new monitoring tools, new gpu memory manager

Knet 1.1.0 introduced the new interface allowing the use of structs in models and callable objects for model / layer definitions.

Knet 1.1.1 focuses on performance. There is a number of performance improvements the most important of which is a new GPU memory manager. The GPU memory use is reduced by up to 50% which should allow larger models and larger batch sizes.

While working on performance improvements I developed some monitoring tools as well. Julia frequently crashes when profiling with GPU code. I decided to use TimerOutputs instead. If the KNET_TIMER environment variable is set while Knet is built, the timing code will be compiled in and the variable should hold timing information for all GPU calls. Similarly the AUTOGRAD_TIMER environment variable controls whether AutoGrad puts timing information for forward and backward passes over the tape into the variable. Here is what sample outputs look like:



                                Time                   Allocations      

                        ──────────────────────   ───────────────────────

    Tot / % measured:        4.62s / 30.4%            546MiB / 25.0%    

 Section        ncalls     time   %tot     avg     alloc   %tot      avg


 +.[2]               1    328ms  23.3%   328ms   46.4MiB  34.1%  46.4MiB

 sum[2]              1    288ms  20.5%   288ms   40.0MiB  29.4%  40.0MiB

   *                 1   38.8ms  2.76%  38.8ms    595KiB  0.43%   595KiB

 *                   1    269ms  19.2%   269ms    955KiB  0.68%   955KiB

 +.                  1    139ms  9.92%   139ms   20.4MiB  15.0%  20.4MiB

 *[1]                1    117ms  8.33%   117ms   9.41MiB  6.90%  9.41MiB

 record              4   88.7ms  6.31%  22.2ms   3.49MiB  2.56%   894KiB

 -[1]                1   65.9ms  4.69%  65.9ms   10.0MiB  7.32%  10.0MiB

 -                   1   55.8ms  3.97%  55.8ms    929KiB  0.67%   929KiB

 sum                 1   50.0ms  3.56%  50.0ms   4.68MiB  3.44%  4.68MiB

 +.[1]               1   1.78ms  0.13%  1.78ms   37.7KiB  0.03%  37.7KiB

 sum_outgrads        5   1.41ms  0.10%   282Ξs   28.2KiB  0.02%  5.64KiB




                                               Time                   Allocations

                                       ──────────────────────   ───────────────────────

           Tot / % measured:                76.3s / 8.89%           4.10GiB / 0.02%

 Section                       ncalls     time   %tot     avg     alloc   %tot      avg


 sum_32_20                        206    4.96s  73.2%  24.1ms   3.22KiB  0.35%        -

 cudaRuntimeGetVersion              1    736ms  10.9%   736ms         -  0.00%        -

 cudaSetDevice                      1    563ms  8.29%   563ms         -  0.00%        -

 cublasSgemm_v2                    96    211ms  3.11%  2.20ms    663KiB  72.1%  6.91KiB

   cublasCreate_v2                  1    166ms  2.44%   166ms         -  0.00%        -

   cublasGetVersion_v2              1   2.95Ξs  0.00%  2.95Ξs         -  0.00%        -

 nvmlInit                           1    161ms  2.37%   161ms         -  0.00%        -

 cudaMemcpy                     5.17k   72.0ms  1.06%  13.9Ξs    191KiB  20.8%        -

 curandCreateGenerator              1   20.0ms  0.29%  20.0ms         -  0.00%        -

 sum_64_20                        602   17.4ms  0.26%  28.9Ξs   9.41KiB  1.02%        -

 cudaMalloc                       456   9.93ms  0.15%  21.8Ξs         -  0.00%        -

0 New