Compilation time of Flux models

Cheers,

I am struggling to pinpoint why a certain FCN with less than 1M parameters trains much slower than another one with 30M parameters. Then, I realized that, on every epoch, the train loop showed up around 90% of total execution time was spent on compilation!

Then, I ran the below code with simple model at several computers. Code is as follows:

using Flux
model = Flux.Conv((1,1), 3=>1)

loss(yhat, y) = Flux.mse(yhat, y)
opt = Flux.Adam()
optstate = Flux.setup(opt, model)

X = rand(Float32, (128,128,3,1))
s = size(model(X))
Y = rand(Bool, s)
data = Flux.DataLoader((X,Y); batchsize=1)

@time Flux.train!(model, data, optstate) do m,x,y
loss(m(x), y)
end

The results on two different CPUs, Ubuntu OS, no GPU, were as follows:

WSL Corei7 CPU

12.385664 seconds (16.22 M allocations: 1.038 GiB, 3.96% gc time, 99.97% compilation time)

0.142981 seconds (331.26 k allocations: 22.813 MiB, 98.65% compilation time)

0.150819 seconds (331.25 k allocations: 22.782 MiB, 4.89% gc time, 98.80% compilation time)

0.142561 seconds (331.26 k allocations: 22.765 MiB, 98.55% compilation time)

0.147581 seconds (331.26 k allocations: 22.772 MiB, 4.23% gc time, 98.46% compilation time)

Ubuntu ARM Ampere CPU

16.378631 seconds (16.22 M allocations: 1.036 GiB, 3.62% gc time, 99.92% compilation time)

0.267535 seconds (331.26 k allocations: 21.639 MiB, 98.50% compilation time)

0.231993 seconds (331.25 k allocations: 21.656 MiB, 5.69% gc time, 98.35% compilation time)

0.216757 seconds (331.26 k allocations: 21.643 MiB, 98.35% compilation time)

0.226803 seconds (331.26 k allocations: 21.648 MiB, 4.76% gc time, 98.53% compilation time)

On both cases, first execution of train! was slower as expected. The next ones were much faster. However, with unexpectedly high compilation percentage.

I wonder if this is a bug in Julia?

Thanks in advance.

Can you show you included this script into your REPL?

You load it every single time again with include?

Or did you just run

@time Flux.train!(model, data, optstate) do m,x,y
    loss(m(x), y)
end

multiple times?

Code is open on VSCode. First run, code executes entirely. Further runs, just select the train! portion and re-run. Results are collected on REPL window, on VSCode.

As you can see on the above picture, we have an extra spice, which is running the code on GPU. Surprisingly, figures were similar to without GPU.

Evaluating selected code in VS Code re-runs it from scratch every time. You are essentially calling eval("@time Flux.train!(...)") in the REPL every time. From this, it’s clear to see why recompilation is required: a new anonymous function do m,x,y ... end is created every time, which means it needs to be recompiled. This also explains why subsequent evaluations only require a fraction of the time to compile, as everything else except that newly-defined anonymous function is already compiled.

To clarify, this is not an issue with using VS Code. This happens when you repeatedly redefine an anonymous function in the REPL. The anonymous function gets recompiled every time:

julia> @time map(1:4) do x
           2x
       end;
  0.025773 seconds (34.32 k allocations: 2.601 MiB, 75.96% compilation time)

julia> @time map(1:4) do x
           2x
       end;
  0.023306 seconds (26.20 k allocations: 1.748 MiB, 99.18% compilation time)

julia> @time map(1:4) do x
           2x
       end;
  0.024590 seconds (26.20 k allocations: 1.748 MiB, 99.30% compilation time)

If the anonymous function only gets defined once, e.g. inside another function, then you don’t get recompilation. With a new Julia session:

julia> function foo()
           map(1:4) do x
               2x
           end
       end;

julia> @time foo();
  0.000003 seconds (1 allocation: 96 bytes)

julia> @time foo();
  0.000003 seconds (1 allocation: 96 bytes)

(In case you’re not aware, the do block syntax creates an anonymous function.)

4 Likes

For bigger models this is probably negligible but essentially all newcomers run into this since it’s mentioned in the docs.

Thank you all for the clarification.

Still in this topic, I’ve switched to BenchmarkTools for making performance measurements.

I’ve run the below simple model on a Corei7 CPU with 8 GB RAM, and found a surprisingly high allocation figure, in excess of 700 MB. On an Ampere ARM CPU with 24GB RAM, allocation surpassed 800 MB.

These figures seems to be a concern for embedded models on devices with small footprint.

Thanks again.

using Flux
using Random
using BenchmarkTools
Random.seed!(1234)
X = rand(Float32, (256,256,64,1))
model = Flux.Conv((3,3), 64 => 64)
@btime model(X) samples=5 seconds=15 gcsample=true

51.593 ms (45 allocations: 740.30 MiB)

The docs are meant to help with learning and not present fully performance optimized example code (it would be too overwhelming). In that context, 100ms of latency for an operation which takes <1s and is only run a couple of times is fine.

Moreover, the docs code blocks are either meant to be run once, or have a loop around train! which amortizes the cost of compiling the closure. Deciding to re-evaluate a single train! call in a REPL is a decision someone has to arrive at independently. While we could put a warning about anonymous functions on every method which takes a callback, that would be excessive. It’s easier to expect people will familiarize themselves with general tips on Julia performance if they run into something like this.

2 Likes

This is almost certainly a separate issue (i.e. not for this thread) and probably caused by the CPU you’re using. Also, make sure you’re interpolating variables like X.