Knet 1.1.0 introduced the new interface allowing the use of structs in models and callable objects for model / layer definitions.
Knet 1.1.1 focuses on performance. There is a number of performance improvements the most important of which is a new GPU memory manager. The GPU memory use is reduced by up to 50% which should allow larger models and larger batch sizes.
While working on performance improvements I developed some monitoring tools as well. Julia frequently crashes when profiling with GPU code. I decided to use TimerOutputs instead. If the KNET_TIMER environment variable is set while Knet is built, the timing code will be compiled in and the Knet.to
variable should hold timing information for all GPU calls. Similarly the AUTOGRAD_TIMER environment variable controls whether AutoGrad puts timing information for forward and backward passes over the tape into the AutoGrad.to
variable. Here is what sample outputs look like:
julia> AutoGrad.to
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
Time Allocations
ââââââââââââââââââââââ âââââââââââââââââââââââ
Tot / % measured: 4.62s / 30.4% 546MiB / 25.0%
Section ncalls time %tot avg alloc %tot avg
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
+.[2] 1 328ms 23.3% 328ms 46.4MiB 34.1% 46.4MiB
sum[2] 1 288ms 20.5% 288ms 40.0MiB 29.4% 40.0MiB
* 1 38.8ms 2.76% 38.8ms 595KiB 0.43% 595KiB
* 1 269ms 19.2% 269ms 955KiB 0.68% 955KiB
+. 1 139ms 9.92% 139ms 20.4MiB 15.0% 20.4MiB
*[1] 1 117ms 8.33% 117ms 9.41MiB 6.90% 9.41MiB
record 4 88.7ms 6.31% 22.2ms 3.49MiB 2.56% 894KiB
-[1] 1 65.9ms 4.69% 65.9ms 10.0MiB 7.32% 10.0MiB
- 1 55.8ms 3.97% 55.8ms 929KiB 0.67% 929KiB
sum 1 50.0ms 3.56% 50.0ms 4.68MiB 3.44% 4.68MiB
+.[1] 1 1.78ms 0.13% 1.78ms 37.7KiB 0.03% 37.7KiB
sum_outgrads 5 1.41ms 0.10% 282Ξs 28.2KiB 0.02% 5.64KiB
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
julia> Knet.to
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
Time Allocations
ââââââââââââââââââââââ âââââââââââââââââââââââ
Tot / % measured: 76.3s / 8.89% 4.10GiB / 0.02%
Section ncalls time %tot avg alloc %tot avg
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
sum_32_20 206 4.96s 73.2% 24.1ms 3.22KiB 0.35% -
cudaRuntimeGetVersion 1 736ms 10.9% 736ms - 0.00% -
cudaSetDevice 1 563ms 8.29% 563ms - 0.00% -
cublasSgemm_v2 96 211ms 3.11% 2.20ms 663KiB 72.1% 6.91KiB
cublasCreate_v2 1 166ms 2.44% 166ms - 0.00% -
cublasGetVersion_v2 1 2.95Ξs 0.00% 2.95Ξs - 0.00% -
nvmlInit 1 161ms 2.37% 161ms - 0.00% -
cudaMemcpy 5.17k 72.0ms 1.06% 13.9Ξs 191KiB 20.8% -
curandCreateGenerator 1 20.0ms 0.29% 20.0ms - 0.00% -
sum_64_20 602 17.4ms 0.26% 28.9Ξs 9.41KiB 1.02% -
cudaMalloc 456 9.93ms 0.15% 21.8Ξs - 0.00% -
0 New