Another post on package compilation time

Albert_Zevelev · June 25, 2020, 12:15am

Let me start by saying I’m very new to issues regarding compilation.
(I never heard of compilation when I used Matlab-R-STATA.)

Suppose I load a Julia package AAA.jl with one function f(x,y) .
The first time I call f(x,y) is usually slow for a given set of argument types. But it is usually much faster after that.

using AAA
f(1,2)   # usually slow, compiles f(Int,Int)
f(1,2)   # much faster
f(2,5)   # equally fast 
f(2.0,5) # slow again, must compile f(Float,Int)
f(2.0,5) # faster

When I run the following w/ ARCHModels.jl:

using ARCHModels
@time selectmodel(TGARCH,BG96,minlags=1,maxlags=1)
@time selectmodel(TGARCH,BG96,minlags=1,maxlags=1)
@time selectmodel(TGARCH,BG96,minlags=2,maxlags=2)
@time selectmodel(TGARCH,BG96,minlags=2,maxlags=2)
@time selectmodel(TGARCH,BG96,minlags=3,maxlags=3)
@time selectmodel(TGARCH,BG96,minlags=3,maxlags=3)
@time selectmodel(TGARCH,BG96,minlags=1,maxlags=3)
@time selectmodel(TGARCH,BG96,minlags=1,maxlags=3)

The compilation time becomes slow every time I change the number of lags, even though the types of each argument is the same:

julia> 
  5.247489 seconds (18.78 M allocations: 927.158 MiB, 6.23% gc time)
  0.005477 seconds (2.68 k allocations: 212.594 KiB)
  1.121684 seconds (3.90 M allocations: 183.957 MiB, 4.12% gc time)
  0.012191 seconds (10.07 k allocations: 1.081 MiB)
  1.179797 seconds (4.01 M allocations: 188.687 MiB, 5.05% gc time)
  0.018243 seconds (10.68 k allocations: 1.540 MiB)
 25.694519 seconds (90.85 M allocations: 4.179 GiB, 3.69% gc time)
  0.098000 seconds (282.74 k allocations: 32.622 MiB)

julia>

Can this be improved?

ericphanson · June 25, 2020, 1:00am

It looks like the minlags gets promoted into the type domain here, which means that this is the same compilation time as when you call a function with a new type.

I don’t know anything about the package, but what that probably means is the author is making the tradeoff to have the compiler treat that value as a constant (meaning it has to compile a new specialisation whenever it gets a new value) to gain runtime performance.

I have a blog post about using the same trick for permuting vectors and I do some light benchmarking to show you can speed up runtime at the cost of compile time first by specialising on the length of the permutation, and then even more dramatically by specialising on the whole permutation itself. Hopefully that helps explain the technique.

Edit: So that means speeding that up is no harder or easier than the usual case, and besides the latency work by the core devs, doing things like using SnoopCompile to generate precompiles for the package could help.

Albert_Zevelev · June 25, 2020, 1:22am

Let me see if I understand bc this is all Greek to me.

The function: selectmodel(TGARCH,BG96,minlags=1,maxlags=3)
fits every lag combo of TGARCH{o,p,q} for lags 1:3.

For each combo of lags, it creates a volatility specification which itself is a new type:
VSi = VS{ind.I[1:ndims] .+ minlags .-1...}
then it fits:
res[ind] = fit(VSi, data; dist=dist, meanspec=MSi)

I think that means that selectmodel() is in fact compiled once, however, fit must compile anew for each new set of lags.

For example:
fit(GARCH{1,1}, BG96, dist=StdNormal, meanspec=Intercept)
Is different from:
fit(GARCH{1,2}, BG96, dist=StdNormal, meanspec=Intercept)

Perhaps one solutions is to quickly compile fit for different combos of lags…

ericphanson · June 25, 2020, 1:25am

Yep, exactly!

Albert_Zevelev · June 25, 2020, 1:44am

There must be a better way. I’m thinking out loud & open to suggestions.

The lag parameters p,q in GARCH{p,q} are basically hyper parameters (HP).
In MLJ.jl, models are compiled once and the compilation price doesn’t have to be paid again for a different set of HP.

using MLJ
X, y =  @load_boston;
train, test = partition(eachindex(y), .7, rng=333);
@time load("LGBMRegressor", verbosity=0)
#
@time mdl = LGBMRegressor(num_iterations = 1_000, min_data_in_leaf=10)
@time mach = machine(mdl, X, y)
@time fit!(mach, rows=train, verbosity=0)
#
@time mdl = LGBMRegressor(num_iterations = 1_500, min_data_in_leaf=10)
@time mach = machine(mdl, X, y)
@time fit!(mach, rows=train, verbosity=0)

Maybe that’s the way to do it…

Or maybe there is a way to directly change the fit function?
So it automatically compiles not for a specific volatility spec (subtype) but for the more general supertype.
fit(GARCH{1,1}, BG96, dist=StdNormal, meanspec=Intercept)
there is probably still a tradeoff betw compilation time & run time…

Tamas_Papp · June 25, 2020, 8:51am

Possibly, but it may be best to ask the package maintainer about the trade-offs and the motivation for this design choice. I see that you already did this:

github.com/s-broda/ARCHModels.jl

Compilation speed

opened 12:02AM - 25 Jun 20 UTC

closed 02:22PM - 29 Jun 20 UTC

azev77

I wonder if compilation speed can be improved. Suppose I load a Julia package… `ACME` with one function `f(x,y)`. The first time I use `f(x,y)` is usually slow for a given set of argument types. But it is usually much faster after that. ```julia using ACME f(1,2) # usually slow f(1,2) # much faster f(2,5) # equally fast f(2.0,5) # slow again, bc it prev compiled f(Int,Int), now it must compile f(Float,Int) f(2.0,5) # faster ``` When I run: ```julia using ARCHModels @time selectmodel(TGARCH,BG96,minlags=1,maxlags=1) @time selectmodel(TGARCH,BG96,minlags=1,maxlags=1) @time selectmodel(TGARCH,BG96,minlags=2,maxlags=2) @time selectmodel(TGARCH,BG96,minlags=2,maxlags=2) @time selectmodel(TGARCH,BG96,minlags=3,maxlags=3) @time selectmodel(TGARCH,BG96,minlags=3,maxlags=3) @time selectmodel(TGARCH,BG96,minlags=1,maxlags=3) @time selectmodel(TGARCH,BG96,minlags=1,maxlags=3) ``` The compilation time becomes slow every time I change the number of lags, even though the types of each argument is the same: ```julia julia> 5.247489 seconds (18.78 M allocations: 927.158 MiB, 6.23% gc time) 0.005477 seconds (2.68 k allocations: 212.594 KiB) 1.121684 seconds (3.90 M allocations: 183.957 MiB, 4.12% gc time) 0.012191 seconds (10.07 k allocations: 1.081 MiB) 1.179797 seconds (4.01 M allocations: 188.687 MiB, 5.05% gc time) 0.018243 seconds (10.68 k allocations: 1.540 MiB) 25.694519 seconds (90.85 M allocations: 4.179 GiB, 3.69% gc time) 0.098000 seconds (282.74 k allocations: 32.622 MiB) julia> ``` Can this be improved?

I would recommend waiting for a response there since your question pertains to a particular package.

Incidentally, it is not clear to me what version of Julia you are using, but 1.6/master improved this a lot.

s-broda · June 25, 2020, 9:41am

Package author here. As mentioned by @ericphanson, indeed the lag length is a type parameter, so every time you estimate a model with a new lag length, it is a new type and thus certain methods need to be recompiled (e.g., the likelihood). The advantage of doing this is that it allows the compiler to make certain optimizations, such as unrolling the loop in the likelihood for small values of p and q. The tradeoff is more compilation. I’ve however benchmarked this against not parameterizing the type, and the difference in compile times wasn’t actually all that large. I’m not entirely sure why that is; I think it must have to do with the fact that either way, you wind up passing a new closure into Optim.optimize for every model you’re fitting (because optimize only accepts single argument functions; see here), and that triggers some recompilation anyhow. Perhaps @pkofod can chime in here.

s-broda · June 25, 2020, 9:53am

One more thing: while the compile time issue will improve with time (it is high on the core devs’ priority list), you can always bake the package into your system image with PackageCompiler.jl with a suitable snoop file if you really care about latency. Since usually people only estimate models with moderate lag length, it should be feasible to compile in the most common ones.

The upside is that after compilation, the speed of the package is really state of the art. Quoting from our paper:

We find that ARCHModels.jl is faster than Python’s ARCH package by a factor of about 6, faster than Matlab’s Econometrics Toolbox by a factor of about 12, and faster than R’s rugarch package by a factor of about 50. This is despite the fact that all packages (except ARCHModels.jl) implement the likelihood function in a compiled low-level language.

s-broda · June 25, 2020, 10:59am

You can actually repeat my benchmarks for yourself: the nospec branch on Github contains a TGARCHn type that isn’t parameterized on lag length. Let’s compare the speed of fit (selectmodel doesn’t work on that branch).

Current implementation:

julia> @time fit(TGARCH{0, 1, 1}, BG96);
  1.426989 seconds (4.21 M allocations: 205.227 MiB, 3.95% gc time)
  
julia> @time fit(TGARCH{0, 1, 2}, BG96);
  0.453361 seconds (2.07 M allocations: 96.613 MiB, 4.03% gc time)

Without parameterizing the type (this is in a fresh session):

julia> @time fit(TGARCHn(0, 1, 1), BG96);
  1.455644 seconds (4.28 M allocations: 206.525 MiB, 3.13% gc time)

julia> @time fit(TGARCHn(0, 1, 2), BG96);
  0.477704 seconds (1.96 M allocations: 88.964 MiB, 5.86% gc time)

As you can see, there’s basically no difference. Do note that in both examples, the call with a different lag length is a little bit faster than the first, because not everything is getting recompiled.

After the first call, the implementation with type parameters is faster than without:

julia> @btime fit(TGARCHn(0, 1, 1), BG96);
  2.838 ms (2131 allocations: 161.38 KiB)

julia> @btime fit(TGARCH{0, 1, 1}, BG96);
  2.698 ms (2130 allocations: 161.23 KiB)

That’s the compiler optimizations that this enables. Admittedly, the difference is not very big at around 5% though.

pkofod · June 25, 2020, 1:53pm

You could make a

struct LogLike{T, R, E}
  SD::T 
  data::R 
  x::E 
end
function (ll::LogLike)(x)
  -loglike(ll.SD, ll.data, ll.x
end

type I guess.

s-broda · June 25, 2020, 2:17pm

Thanks for chiming in! Maybe I’m being dense, but isn’t this exactly how closures are implemented under the hood? Wouldn’t that incur the same problem?

Unless maybe if the struct were mutable so that one can reach in and change these parameters after the fact, thus reusing it?

jebej · June 25, 2020, 2:58pm

Presumably the argument types do not change, and therefore passing them within a functor (struct with a function attached) LogLike that you create instead of a closure will always reuse the same method (ll::LogLike{T, R, E})(x) after it has been compiled the first time.

edit: That is, even if you make a new instance of the struct with different parameters. The difference with a closure is that a new closure is created each time, which is a new type for which the method must be recompiled.

s-broda · June 25, 2020, 5:52pm

Now this is interesting: the issue isn’t actually the closure. I’ve tried the functor approach, and it’s as slow. The issue is actually with ForwardDiff. It seems that every time the number of parameters changes, it’s actually the gradients that get recompiled, because they have a different dimension.

I can force this not to happen by zero-padding the coefficient vector to a fixed length. This seems to work, but this feels like a terrible hack. I’m also not sure if that’s safe to do, i.e., will I still get correct derivatives? After all, the meaning of the parameters changes between models.

Tamas_Papp · June 26, 2020, 8:05am

I think this will be more costly at runtime.

To put this question in a bit broader context: Julia allows packages like ForwardDiff to generate efficient and specialized code, at the cost of some compilation time. This trade-off is embraced by a lot of packages in the ecosystem. Specifically for ForwardDiff, a tag to avoid perturbation confusion is also part of the type.

It is conceivable that one could implement a forward AD package that does put the dimension in the type, resulting in different trade-offs, but this has not been done AFAIK. I would speculate that this (“fast time to first AD”) is a niche application given the focus/audience of Julia, but it might be an interesting experiment.

s-broda · June 29, 2020, 2:27pm

@Tamas_Papp Turns out this is not all that costly at runtime. I think I’ve found a good compromise: when doing model selection (and only then), I now estimate each model as a subset-GARCH model nested within a larger model determined by a maxlags parameter. This effectively “lowers” the type parameters into the value domain temporarily.

Taking a GARCH(1, 1) nested in a TGARCH{5, 5, 5} as an example, this increases the runtime from around 3 milliseconds to 12 milliseconds on a dataset with ~2k observations. It does, however, avoid the ~500 milliseconds of compilation time, so this is a good tradeoff when estimating many different models, as occurs in model selection. In a fresh session, we now have

julia> @time selectmodel(TGARCH, BG96; minlags=0, maxlags=3);
  9.977057 seconds (35.13 M allocations: 1.707 GiB, 22.18% gc time)

julia> @time selectmodel(TGARCH, BG96; minlags=0, maxlags=3);
  0.110632 seconds (876.17 k allocations: 108.719 MiB)

compared to

julia> @time selectmodel(TGARCH, BG96; minlags=0, maxlags=3);
 91.165097 seconds (394.59 M allocations: 17.032 GiB, 4.62% gc time)

julia> @time selectmodel(TGARCH, BG96; minlags=0, maxlags=3);
  0.062485 seconds (866.18 k allocations: 82.316 MiB)

before. For estimation that does not happen in a model selection context, I still specialize on the type, because one might want to do this thousands of times (e.g., when backtesting or fitting a large-dimensional DCC model that uses univariate models in the marginals), so it makes sense to squeeze out every bit of performance then.

@Albert_Zevelev Thanks a lot for making me re-evaluate this. This really improves the user experience quite substantially. If you want to give it a spin, it’s part of version 1.2.0 which will be available as soon as https://github.com/JuliaRegistries/General/pull/17176 gets merged.

Topic		Replies	Views
Julia vs R vs Python: simple MLE problem Optimization (Mathematical)	43	5207	April 5, 2019
22 seconds to 3 and now more: Let's fix all of the DifferentialEquations.jl + universe compile times! Performance precompilation	6	5028	August 17, 2021
Potential performance regressions in Julia 1.8 for special un-precompiled type dispatches and how to fix them Performance performance	25	3452	October 4, 2022
Why is rust compilation faster than Julia pre-compilation? Internals & Design question	37	3684	April 26, 2024
Extremely high first call latency Julia 1.6 versus 1.5 with multiphysics PDE solver Performance question	27	2742	December 2, 2021

Another post on package compilation time

Related topics