Knet 1.1.1 is out: performance improvements, new monitoring tools, new gpu memory manager


#1

Knet 1.1.0 introduced the new interface allowing the use of structs in models and callable objects for model / layer definitions.

Knet 1.1.1 focuses on performance. There is a number of performance improvements the most important of which is a new GPU memory manager. The GPU memory use is reduced by up to 50% which should allow larger models and larger batch sizes.

While working on performance improvements I developed some monitoring tools as well. Julia frequently crashes when profiling with GPU code. I decided to use TimerOutputs instead. If the KNET_TIMER environment variable is set while Knet is built, the timing code will be compiled in and the Knet.to variable should hold timing information for all GPU calls. Similarly the AUTOGRAD_TIMER environment variable controls whether AutoGrad puts timing information for forward and backward passes over the tape into the AutoGrad.to variable. Here is what sample outputs look like:

julia> AutoGrad.to

───────────────────────────────────────────────────────────────────────

                                Time                   Allocations      

                        ──────────────────────   ───────────────────────

    Tot / % measured:        4.62s / 30.4%            546MiB / 25.0%    

 Section        ncalls     time   %tot     avg     alloc   %tot      avg

 ───────────────────────────────────────────────────────────────────────

 +.[2]               1    328ms  23.3%   328ms   46.4MiB  34.1%  46.4MiB

 sum[2]              1    288ms  20.5%   288ms   40.0MiB  29.4%  40.0MiB

   *                 1   38.8ms  2.76%  38.8ms    595KiB  0.43%   595KiB

 *                   1    269ms  19.2%   269ms    955KiB  0.68%   955KiB

 +.                  1    139ms  9.92%   139ms   20.4MiB  15.0%  20.4MiB

 *[1]                1    117ms  8.33%   117ms   9.41MiB  6.90%  9.41MiB

 record              4   88.7ms  6.31%  22.2ms   3.49MiB  2.56%   894KiB

 -[1]                1   65.9ms  4.69%  65.9ms   10.0MiB  7.32%  10.0MiB

 -                   1   55.8ms  3.97%  55.8ms    929KiB  0.67%   929KiB

 sum                 1   50.0ms  3.56%  50.0ms   4.68MiB  3.44%  4.68MiB

 +.[1]               1   1.78ms  0.13%  1.78ms   37.7KiB  0.03%  37.7KiB

 sum_outgrads        5   1.41ms  0.10%   282μs   28.2KiB  0.02%  5.64KiB

 ───────────────────────────────────────────────────────────────────────

julia> Knet.to

 ──────────────────────────────────────────────────────────────────────────────────────

                                               Time                   Allocations

                                       ──────────────────────   ───────────────────────

           Tot / % measured:                76.3s / 8.89%           4.10GiB / 0.02%

 Section                       ncalls     time   %tot     avg     alloc   %tot      avg

 ──────────────────────────────────────────────────────────────────────────────────────

 sum_32_20                        206    4.96s  73.2%  24.1ms   3.22KiB  0.35%        -

 cudaRuntimeGetVersion              1    736ms  10.9%   736ms         -  0.00%        -

 cudaSetDevice                      1    563ms  8.29%   563ms         -  0.00%        -

 cublasSgemm_v2                    96    211ms  3.11%  2.20ms    663KiB  72.1%  6.91KiB

   cublasCreate_v2                  1    166ms  2.44%   166ms         -  0.00%        -

   cublasGetVersion_v2              1   2.95μs  0.00%  2.95μs         -  0.00%        -

 nvmlInit                           1    161ms  2.37%   161ms         -  0.00%        -

 cudaMemcpy                     5.17k   72.0ms  1.06%  13.9μs    191KiB  20.8%        -

 curandCreateGenerator              1   20.0ms  0.29%  20.0ms         -  0.00%        -

 sum_64_20                        602   17.4ms  0.26%  28.9μs   9.41KiB  1.02%        -

 cudaMalloc                       456   9.93ms  0.15%  21.8μs         -  0.00%        -

0 New