Compilation time of Flux models

cirobr · February 28, 2024, 12:22pm

Cheers,

I am struggling to pinpoint why a certain FCN with less than 1M parameters trains much slower than another one with 30M parameters. Then, I realized that, on every epoch, the train loop showed up around 90% of total execution time was spent on compilation!

Then, I ran the below code with simple model at several computers. Code is as follows:

using Flux
model = Flux.Conv((1,1), 3=>1)

loss(yhat, y) = Flux.mse(yhat, y)
opt = Flux.Adam()
optstate = Flux.setup(opt, model)

X = rand(Float32, (128,128,3,1))
s = size(model(X))
Y = rand(Bool, s)
data = Flux.DataLoader((X,Y); batchsize=1)

@time Flux.train!(model, data, optstate) do m,x,y
loss(m(x), y)
end

The results on two different CPUs, Ubuntu OS, no GPU, were as follows:

WSL Corei7 CPU

12.385664 seconds (16.22 M allocations: 1.038 GiB, 3.96% gc time, 99.97% compilation time)

0.142981 seconds (331.26 k allocations: 22.813 MiB, 98.65% compilation time)

0.150819 seconds (331.25 k allocations: 22.782 MiB, 4.89% gc time, 98.80% compilation time)

0.142561 seconds (331.26 k allocations: 22.765 MiB, 98.55% compilation time)

0.147581 seconds (331.26 k allocations: 22.772 MiB, 4.23% gc time, 98.46% compilation time)

Ubuntu ARM Ampere CPU

16.378631 seconds (16.22 M allocations: 1.036 GiB, 3.62% gc time, 99.92% compilation time)

0.267535 seconds (331.26 k allocations: 21.639 MiB, 98.50% compilation time)

0.231993 seconds (331.25 k allocations: 21.656 MiB, 5.69% gc time, 98.35% compilation time)

0.216757 seconds (331.26 k allocations: 21.643 MiB, 98.35% compilation time)

0.226803 seconds (331.26 k allocations: 21.648 MiB, 4.76% gc time, 98.53% compilation time)

On both cases, first execution of train! was slower as expected. The next ones were much faster. However, with unexpectedly high compilation percentage.

I wonder if this is a bug in Julia?

Thanks in advance.

roflmaostc · February 28, 2024, 12:47pm

Can you show you included this script into your REPL?

You load it every single time again with include?

Or did you just run

@time Flux.train!(model, data, optstate) do m,x,y
    loss(m(x), y)
end

multiple times?

cirobr · February 28, 2024, 12:55pm

Code is open on VSCode. First run, code executes entirely. Further runs, just select the train! portion and re-run. Results are collected on REPL window, on VSCode.

cirobr · February 28, 2024, 1:12pm

As you can see on the above picture, we have an extra spice, which is running the code on GPU. Surprisingly, figures were similar to without GPU.

ToucheSir · March 7, 2024, 3:10pm

Evaluating selected code in VS Code re-runs it from scratch every time. You are essentially calling eval("@time Flux.train!(...)") in the REPL every time. From this, it’s clear to see why recompilation is required: a new anonymous function do m,x,y ... end is created every time, which means it needs to be recompiled. This also explains why subsequent evaluations only require a fraction of the time to compile, as everything else except that newly-defined anonymous function is already compiled.

CameronBieganek · March 7, 2024, 3:30pm

To clarify, this is not an issue with using VS Code. This happens when you repeatedly redefine an anonymous function in the REPL. The anonymous function gets recompiled every time:

julia> @time map(1:4) do x
           2x
       end;
  0.025773 seconds (34.32 k allocations: 2.601 MiB, 75.96% compilation time)

julia> @time map(1:4) do x
           2x
       end;
  0.023306 seconds (26.20 k allocations: 1.748 MiB, 99.18% compilation time)

julia> @time map(1:4) do x
           2x
       end;
  0.024590 seconds (26.20 k allocations: 1.748 MiB, 99.30% compilation time)

If the anonymous function only gets defined once, e.g. inside another function, then you don’t get recompilation. With a new Julia session:

julia> function foo()
           map(1:4) do x
               2x
           end
       end;

julia> @time foo();
  0.000003 seconds (1 allocation: 96 bytes)

julia> @time foo();
  0.000003 seconds (1 allocation: 96 bytes)

(In case you’re not aware, the do block syntax creates an anonymous function.)

roflmaostc · March 8, 2024, 5:35pm

For bigger models this is probably negligible but essentially all newcomers run into this since it’s mentioned in the docs.

cirobr · March 16, 2024, 1:40pm

Thank you all for the clarification.

Still in this topic, I’ve switched to BenchmarkTools for making performance measurements.

I’ve run the below simple model on a Corei7 CPU with 8 GB RAM, and found a surprisingly high allocation figure, in excess of 700 MB. On an Ampere ARM CPU with 24GB RAM, allocation surpassed 800 MB.

These figures seems to be a concern for embedded models on devices with small footprint.

Thanks again.

using Flux
using Random
using BenchmarkTools
Random.seed!(1234)
X = rand(Float32, (256,256,64,1))
model = Flux.Conv((3,3), 64 => 64)
@btime model(X) samples=5 seconds=15 gcsample=true

51.593 ms (45 allocations: 740.30 MiB)

ToucheSir · March 19, 2024, 4:51pm

The docs are meant to help with learning and not present fully performance optimized example code (it would be too overwhelming). In that context, 100ms of latency for an operation which takes <1s and is only run a couple of times is fine.

Moreover, the docs code blocks are either meant to be run once, or have a loop around train! which amortizes the cost of compiling the closure. Deciding to re-evaluate a single train! call in a REPL is a decision someone has to arrive at independently. While we could put a warning about anonymous functions on every method which takes a callback, that would be excessive. It’s easier to expect people will familiarize themselves with general tips on Julia performance if they run into something like this.

ToucheSir · March 19, 2024, 4:54pm

This is almost certainly a separate issue (i.e. not for this thread) and probably caused by the CPU you’re using. Also, make sure you’re interpolating variables like X.

Topic		Replies	Views
[Flux / Zygote] Circumvent long compilation times Machine Learning flux , zygote	1	470	April 12, 2022
Reproducing TF model with Flux- Slow? Machine Learning	2	282	June 24, 2023
Why is flux model slower than python? Performance benchmark , flux	6	2302	May 11, 2024
Flux running slow? Machine Learning	16	2742	August 19, 2021
Flux slows down by 10x when moving from local system to high performance cluster Performance flux	10	290	September 24, 2024

Compilation time of Flux models

WSL Corei7 CPU

Ubuntu ARM Ampere CPU

Related topics