ANN: TimerOutputs.jl



Dear Julia friends,

I am glad to present TimerOutputs, a Julia package I created quite a long while ago and to which I recently made significant updates and improvements. You could consider it the big brother of the @time macro in Base. As you know, the time macro prints the timings immediately when the timed section is finished. In TimerOutputs, one assigns a label to each timed section and thereby, times and allocations can be accumulated over the course of multiple calls to that section. When desired, data for the timed sections can be printed in a nicely formatted table which can be tweaked to only print the things you are interested in.

My aim for this package is to be useful for both people who want to time sections in their own programs (finite element simulations, optimization routines etc.) and also for package authors who want to present information about timings and allocations to the user. As an example, optimization packages could use this to easily measure how much time is spent in the user given function vs. in the actual optimization routine.

A short toy example to show the syntax and output is shown below:

using TimerOutputs

# Create the timer object
to = TimerOutput()

# Time something with an assigned label
@timeit to "sleep" sleep(0.3)

# Data is accumulated for multiple calls
for i in 1:100
    @timeit to "loop" 1+1

# Nested sections are possible
@timeit to "nest 1" begin
    @timeit to "nest 2" begin
        @timeit to "nest 3.1" rand(10^3)
        @timeit to "nest 3.2" rand(10^4)
        @timeit to "nest 3.3" rand(10^5)

Displaying the to timer now shows:

                                Time                   Allocations      
                        ──────────────────────   ───────────────────────
    Tot / % measured:        6.48s / 5.60%           77.4MiB / 12.0%    

 Section        ncalls     time   %tot     avg     alloc   %tot      avg
 sleep               1    338ms  93.2%   338ms    804KiB  8.43%   804KiB
 nest 1              1   24.7ms  6.80%  24.7ms   8.52MiB  91.5%  8.52MiB
   nest 2            1   9.10ms  2.51%  9.10ms    899KiB  9.43%   899KiB
     nest 3.1        1   3.27ms  0.90%  3.27ms   8.67KiB  0.09%  8.67KiB
     nest 3.3        1   3.05ms  0.84%  3.05ms    796KiB  8.34%   796KiB
     nest 3.2        1   2.68ms  0.74%  2.68ms   92.4KiB  0.97%  92.4KiB
 loop              100   6.97ΞΌs  0.00%  69.7ns   6.08KiB  0.06%      62B

The README has full documentation of what the data in the table presents, customization of how the table is displayed, how to programmatically access the data in the table and much more.

As a more realistic example we can use TimerOutputs to β€œprofile” Optim when it optimizes a function. In the example below, we load an example optimization problem, create timed wrappers for the function to be optimized as well as the gradient and hessian, and also time the optimize call:

using Optim
using TimerOutputs

const to = TimerOutput()
prob = Optim.UnconstrainedProblems.examples["Rosenbrock"];

f(x    ) =  @timeit to "f"  prob.f(x)
g!(x, g) =  @timeit to "g!" prob.g!(x, g)
h!(x, h) =  @timeit to "h!" prob.h!(x, h)

@timeit to "Trust Region" begin
    res = Optim.optimize(f, g!, h!, prob.initial_x, method=NewtonTrustRegion())
show(to; allocations = false)

With these show options, the displayed output is:

    Tot / % measured:        804ΞΌs / 95.7%    

 Section        ncalls     time   %tot     avg
 Trust Region        1    769ΞΌs   100%   769ΞΌs
   g!               16   5.95ΞΌs  0.77%   372ns
   f                16   3.82ΞΌs  0.50%   239ns
   h!               15   2.99ΞΌs  0.39%   199ns

This shows that for the very simple optimization problem, most of the time is not spent in the functions being optimized but in Optim. For a non-toy function the results would of course be very different.

Hope you find the package useful and if you have any issues or requests, just open an issue on the repo.

Thank you for your attention and happy timing!

// Kristoffer Carlsson


Thank you Kristoffer for this wonderful package!

Profiling Knet.jl had been a pain because of frequent nondeterministic crashes when using CUDA libraries. With TimerOutputs there is no crash, we can finally profile GPU code to our heart’s content. Here is a mini-tutorial on profiling code on the GPU such as deep learning models:

First remember to cudaDeviceSynchronize where appropriate, otherwise the timer will measure a lot of time spent on the wrong lines: not the expensive functions but whereever the asynchronous calls complete. In Knet this is achieved by editing the following macro, replacing if false with if true:

macro gs(); if false; esc(:(ccall(("cudaDeviceSynchronize","libcudart"),UInt32,()))); end; end

Then sprinkle appropriate @timer expressions over your code:

using TimerOutputs
const to = TimerOutput()

function predict(model, x, ygold; odrop=(0.25, 0.25, 0.5))
    @timeit to "conv" _conv1 = conv4(model[1], x;padding=1)
    @timeit to "bias" conv1_2 = _conv1 .+ model[2]
    @timeit to "relu" rel1 = relu_dot(conv1_2)
    @timeit to "conv" _conv2 = conv4(model[3], rel1; padding=1)
    @timeit to "bias" conv2_2 = _conv2 .+ model[4]
    @timeit to "pool" pool1 = pool(conv2_2)

And get beautiful tables after running your code:

julia> to
                           Time                   Allocations
                   ──────────────────────   ───────────────────────
 Tot / % measured:      15.4s / 69.1%           26.6MiB / 65.9%

 Section   ncalls     time   %tot     avg     alloc   %tot      avg
 conv       3.13k    8.39s  79.0%  2.68ms   9.22MiB  52.6%  3.02KiB
 bias       3.13k    1.08s  10.2%   346ΞΌs   3.41MiB  19.5%  1.12KiB
 fc         1.56k    607ms  5.72%   388ΞΌs   1.31MiB  7.49%        -
 drop       2.35k    319ms  3.00%   136ΞΌs    565KiB  3.15%        -
 pool       1.56k    222ms  2.09%   142ΞΌs   3.00MiB  17.1%  1.96KiB
 flat         782    395ΞΌs  0.00%   505ns   24.4KiB  0.14%        -
 relu       3.13k    143ΞΌs  0.00%  45.7ns         -  0.00%        -

This immediately showed me that conv was the most significant cost (as expected) and after a bit research on CUDNN’s auto-tune facilities I was able to double the speed of our code.



I recently added (and tagged) functionality to annotate function definitions. This is sometimes more convenient than adding them at the call site or surrounding a whole function block with the macro. Annotating a function definition is done with the same @timeit macro and this will record a section with the same name as the function when the function is called.

For example:

julia> using TimerOutputs

julia> const to = TimerOutput();

julia> @timeit to f(x) = x
f (generic function with 1 method)

julia> @timeit to function g(x)
           # Doing stuff
           return sin(x)
g (generic function with 1 method)

julia> for i in 1:100 f(2) end;

julia> for j in 1:10 g(2.0) end;

julia> to
                           Time                   Allocations
                   ──────────────────────   ───────────────────────
 Tot / % measured:      9.39s / 0.00%           1.08MiB / 0.00%

 Section   ncalls     time   %tot     avg     alloc   %tot      avg
 f            100   5.15ΞΌs  75.2%  51.5ns         -   - %         -
 g             10   1.70ΞΌs  24.8%   170ns         -   - %         -

The timer to can be left out and the global one will be used in that case:

julia> @timeit h(x) = x
h (generic function with 1 method)

julia> h(2)

julia> print_timer()
                           Time                   Allocations
                   ──────────────────────   ───────────────────────
 Tot / % measured:      3.88s / 0.00%            289KiB / 0.00%

 Section   ncalls     time   %tot     avg     alloc   %tot      avg
 h              1    295ns   100%   295ns         -   - %         -

Happy timing!


This code works wonderfully. Thank you very much, @kristoffer.carlsson.