Completely frustrated with compilation times

I am working with Julia 1.5.3 on the development of a new package DescriptorSystems, which basically manipulates 5-tuples of matrices (which describe so called descriptor systems). Recently, I added a new object, which is a rational function (i.e., a pair of numerator and denominator polynomials) + a real entity called sampling time. The definition of a rational transfer function is

struct RationalTransferFunction{T,X} <: AbstractRationalTransferFunction
    num::Polynomial{T,X}        # numerator polynomial
    den::Polynomial{T,X}        # denominator polynomial
    Ts::Union{Real,Nothing}   # sampling time 
end

where the Polynomial object is defined in the last version (v2.0.0) of the Polynomials package.

One of the basic constructors of descriptor systems (a function called dss ) takes a rational matrix formed with elements rational transfer functions and produces a 5-tuple of matrices (called a realization). The generation is quite involved and calls functions for manipulation of matrix pencils (or pairs of matrices) provided by the MatrixPencils package.

A typical sequence used for test purposes is:

julia> @time z = rtf('z');     # define the variable z as a rational function
  0.133723 seconds (212.48 k allocations: 11.060 MiB, 30.90% gc time)

julia> @time Gd = [z^2 z/(z-2); 0 1/z];     # define the 2-by-2 improper Gd(z)
 11.165572 seconds (44.79 M allocations: 2.435 GiB, 9.00% gc time)

julia> @time sysd = dss(Gd,minimal = true,Ts = 1);
647.785513 seconds (1.86 G allocations: 110.919 GiB, 4.23% gc time)

where I included the times for the first execution (thus compilation times also included). The time for executing dss is by no means acceptable for an user, even if the second exection is very fast:

julia> @time z = rtf('z');     # define the variable z as rational function
  0.000020 seconds (16 allocations: 1.125 KiB)

julia> @time Gd = [z^2 z/(z-2); 0 1/z];     # define the 2-by-2 improper Gd(z)
  0.000089 seconds (134 allocations: 8.453 KiB)

julia> @time sysd = dss(Gd,minimal = true,Ts = 1);
  0.000230 seconds (349 allocations: 499.531 KiB)

The alternative construction using two polynomial matrices (for numerators and denominators) behaves much better at the first execution:

julia> @time z = Polynomial([0, 1],'z') # define z as a monomial
  0.017358 seconds (55.52 k allocations: 2.897 MiB)
Polynomial(z)

julia> @time Nd = [z^2 z; 0 1]; Dd = [1 z-2; 1 z]; # define numerators and denominators
  0.628269 seconds (2.70 M allocations: 140.437 MiB, 2.96% gc time)

julia> @time sysd = dss(Nd,Dd,minimal = true,Ts = 1);
 25.080765 seconds (83.56 M allocations: 3.849 GiB, 5.50% gc time)

I would appreciate any help, which would contribute to improving substantially the compilation times.
I have the feeling that I did something basically wrong in the defintion of my rational function object, but I simply don’t know how to trace this issue back to the roots of performance loss.

I would like to add that before switching to v2.0 of Polynomials, the compilation times were more or less acceptable (the whole test suite required about 20-30 minutes). Presently, with Julia 1.6, the execution time of the whole test suite takes about 22 minutes, 44 minutes with Julia 1.5, and over 1.5 hours for Julia 1.2, 1.3 and 1.4.

Many thanks in advance for your time and interest.

6 Likes

You may obtain some insights from this thread: Understanding and optimizing compiler time (just a bit)

3 Likes

In case you did not find it buried in the thread above, there is a blog post dedicated to profiling these issues. Both 25 seconds and 647 seconds is really long. I wonder what is happening.

3 Likes

Looks like a very nice tool for a linear systems theory course. Would it be possible to somehow make this package work in tandem with the ControlSystems package? Also, any plans to include Smith form/Smith-McMillan form?

2 Likes

It’s hard to know without a minimum working example of all the code (what is rtf?), but Ts in your struct isn’t type stable, you could just use a type parameter instead of Union{Real,Nothing}.

After that, something much too complicated is being compiled - 119 GiB allocation is an insane amount. You may need to look at the functions in the packages you mention to understand what they are doing, and how it is failing in your case…

Also, I would go easy on the frustrated tone, this is open source software and not all packages have been written for your use case, and they have no responsibility to make it fast. It’s also possible the problem is in your own code. Providing a minimum working example would help everyone help you - people may not read through all those packages.

4 Likes

Lots of abstract types around. And Unions of abstract types. Something is causing 1 G allocations (!). I would start with some changes to arrive at concrete types for the fields.

2 Likes

Often in performance issues, I like to start with measurements to identify where the problem is, without changing anything. SnoopCompile.jl · SnoopCompile is designed to help with measurements.

4 Likes

A bit of context. In Polynomials v2.0.0 the major change was to move the symbol from a field into the type domain (Polynomial{T} became Polynomial{T,X}). The goal of this change was to make it more idiomatic to use matrices of polynomials (though that hasn’t been as successful as anticipated) at the expense of recompilations when the symbol changed (which was anticipated to be limited).

Many thanks for all your replies. I already learned a lot of secrets of Julia compilation process. And I am aware that what is wrong entirely lies in my implementations, involving probably the choice of data structure. Still, my case is so extreme (this is my opinion), perhaps it is worth to try to figure out where things go wrong.

I was asked of a reproducible example. This certainly involves installing the packages DescriptorSystems, MatrixPencils and MatrixEquations (actually this last one is not involved in the example).
Then, the following sequence leads to the reported results:

### Original example
using DescriptorSystems
@time z = rtf('z');
@time Gd = [z^2 z/(z-2); 0 1/z];
@time sysd = dss(Gd,minimal = true,Ts = 1);

Note: The function rtf just builds the variable z as a rational function with a polynomial z over 1.

Following the recommendations in your replies, I started myself to analyse the compilation process using the SnoopCompile package running on Julia 1.5.3. Here is some output:

### Analysis with @snoopi
julia> using DescriptorSystems
julia> using SnoopCompile
julia> inf_timing = @snoopi tmin=0.01 z = rtf('z')
2-element Array{Tuple{Float64,Core.MethodInstance},1}:
 (0.03800010681152344, MethodInstance for analyze_method!(::Int64, ::Core.Compiler.Signature, ::Any, ::Core.SimpleVector, ::Method, ::Expr, ::Core.Compiler.OptimizationState, ::Bool, ::Nothing, ::Any))
 (0.25499987602233887, MethodInstance for rtf(::Char))

 julia> @time z = rtf('z');
 0.000013 seconds (16 allocations: 1.125 KiB)

julia> inf_timing = @snoopi tmin=0.01 Gd = [z^2 z/(z-2); 0 1/z]
4-element Array{Tuple{Float64,Core.MethodInstance},1}:
 (0.016000032424926758, MethodInstance for convn!(::Array{Int64,1}, ::Array{Int64,1}, ::Array{Int64,1}))
 (0.03099989891052246, MethodInstance for _nloops(::Int64, ::Symbol, ::Expr, ::Expr))
 (0.3710000514984131, MethodInstance for literal_pow(::typeof(^), ::RationalTransferFunction{Int64,:z}, ::Val{2}))
 (12.039999961853027, MethodInstance for hvcat(::Tuple{Int64,Int64}, ::RationalTransferFunction{Int64,:z}, ::Vararg{Union{Number, AbstractRationalTransferFunction},N} where N))

 julia> @time julia> Gd = [z^2 z/(z-2); 0 1/z];
  0.000025 seconds (27 allocations: 1.858 KiB)

 julia> inf_timing = @snoopi tmin=0.01 sysd = dss(Gd,minimal = true,Ts = 1)
 9-element Array{Tuple{Float64,Core.MethodInstance},1}:
 (0.017999887466430664, MethodInstance for setindex!(::Array{Float64,2}, ::SubArray{Float64,1,LinearAlgebra.Transpose{Float64,Array{Float64,2}},Tuple{Base.Slice{Base.OneTo{Int64}},Int64},false}, ::Base.OneTo{Int64}, ::Int64)) (0.031000137329101562, MethodInstance for hvcat(::Tuple{Int64,Int64}, ::LinearAlgebra.UniformScaling{Bool}, ::Vararg{Union{AbstractArray{T1<:RationalTransferFunction,1}, AbstractArray{Float64,2}, AbstractArray{Float64,1}, AbstractArray{T2<:Polynomials.Polynomial,2}, AbstractArray{T2<:Polynomials.Polynomial,1}, AbstractArray{T1<:RationalTransferFunction,2}, RationalTransferFunction{T4<:Number,X} where X, LinearAlgebra.UniformScaling},N} where N)) (0.03299999237060547, MethodInstance for _hvcat(::Type{Float64}, ::Tuple{Int64,Int64}, ::Array{Bool,2}, ::Vararg{Union{AbstractArray{T,2}, AbstractArray{T,1}} where T,N} where N))
 (0.0690000057220459, MethodInstance for poly2pm(::Array{Polynomials.Polynomial{Int64,:z},2}))
 (1.2840001583099365, MethodInstance for promote_rtf_eltype(::LinearAlgebra.UniformScaling{Bool}, ::Vararg{Any,N} where N))
 (4.610999822616577, MethodInstance for (::DescriptorSystems.var"#dss##kw")(::NamedTuple{(:Ts, :contr, :obs, :noseig, :minimal, :atol, :rtol),Tuple{Int64,Bool,Bool,Bool,Bool,Float64,Float64}}, ::typeof(dss), ::Array{Int64,3}, ::Array{Int64,3}))
 (9.327000141143799, MethodInstance for _eltype(::Array{RationalTransferFunction{Int64,X} where X,2}))
 (288.6180000305176, MethodInstance for #dss#30(::Int64, ::Bool, ::Bool, ::Bool, ::Bool, ::Float64, ::Float64, ::typeof(dss), ::Array{RationalTransferFunction{Int64,X} where X,2}))
 (300.18099999427795, MethodInstance for (::DescriptorSystems.var"#dss##kw")(::NamedTuple{(:minimal, :Ts),Tuple{Bool,Int64}}, ::typeof(dss), ::Array{RationalTransferFunction{Int64,X} where X,2}))
 
 julia> @time sysd = dss(Gd,minimal = true,Ts = 1);
  0.000175 seconds (349 allocations: 499.531 KiB)

From this, it seems that the last call in dss (a call to another instance of dss) is causing the trouble. I reproduce dss just for completeness.

function dss(R::Union{AbstractVecOrMat{<:RationalTransferFunction},RationalTransferFunction}; Ts::Union{Real,Missing} = missing, 
             minimal::Bool = false, contr::Bool = false, obs::Bool = false, noseig::Bool = false, 
             atol::Real = zero(float(real(_eltype(R)))), rtol::Real = 100*eps(one(atol))*iszero(atol)) 
    ismissing(Ts) && (eltype(R) <: RationalTransferFunction ? Ts = R[1].Ts : Ts = R.Ts )
    isnothing(Ts) && (Ts = 0)
    dss(poly2pm(numpoly.(R)), poly2pm(denpoly.(R)); Ts = Ts, contr = contr, obs = obs, 
            noseig = noseig, minimal = minimal, atol = atol, rtol = rtol)
end

Here is the dismantled example reproducing the essential internal calls:

### Dismantled example
using DescriptorSystems
using MatrixPencils
@time z = rtf('z');
@time Gd = [z^2 z/(z-2); 0 1/z];
@time Gdnum = MatrixPencils.poly2pm(numpoly.(Gd)); 
@time Gdden = MatrixPencils.poly2pm(denpoly.(Gd)); 
@time sysd = dss(Gdnum, Gdden; Ts = 1, minimal = true);

and here are results obtained after a new start:

julia> using DescriptorSystems
julia> using MatrixPencils
julia> @time z = rtf('z');
  0.110744 seconds (212.48 k allocations: 11.060 MiB, 17.12% gc time)

julia> @time Gd = [z^2 z/(z-2); 0 1/z];
 12.870268 seconds (51.38 M allocations: 2.879 GiB, 7.67% gc time)

julia> @time Gdnum = MatrixPencils.poly2pm(numpoly.(Gd));
  2.121918 seconds (7.59 M allocations: 468.375 MiB, 5.49% gc time)

julia> @time Gdden = MatrixPencils.poly2pm(denpoly.(Gd));
  0.041375 seconds (128.26 k allocations: 6.539 MiB)

julia> @time sysd = dss(Gdnum, Gdden; Ts = 1, minimal = true);
 20.339689 seconds (57.45 M allocations: 2.530 GiB, 4.41% gc time)

The compilation times are now the usual ones (at least for me) and the code produces the same result.

I look forward to your comments and thank you for your interest.

5 Likes

I am trying first to finish the project, i.e., to arrive to the level of my own Matlab tools DSTOOLS. And, yes, at a certain moment it could by a good idea to couple this tool with the ControlSystems package.

The computation of Smith form/Smith-McMillan form is not in my plan, because the numerical issues are very delicate to handle. This is the general case with the computation of canonical forms. However, all structural and spectral information in the Smith form/Smith-McMillan form can be reliably computed using tools available both in the MatrixPencils package as well as in the DescriptorSystems (see functions gpoleinfo and gzeroinfo).

1 Like

Besides more specific feedback I think two basic links should be given here:

https://docs.julialang.org/en/v1/manual/style-guide/

and

https://docs.julialang.org/en/v1/manual/performance-tips/

are both very helpful addressing these little traps.

1 Like

Maybe a hint:

Note that

this seems to take quite a lot of resources to compile.

If you isolate that command, you can verify that, but note that the 0 in your matrix makes a huge difference:

julia> using MatrixPencils, DescriptorSystems

julia> @time z = rtf('z');
  0.129571 seconds (212.49 k allocations: 11.066 MiB)

julia> @time hvcat((2,2),z^2,z/(z-2),z,1/z); # without the zero
  1.029204 seconds (4.40 M allocations: 216.753 MiB, 5.21% gc time)

# and now, with the zero:
julia> @time hvcat((2,2),z^2,z/(z-2),0,1/z)
 16.626721 seconds (48.82 M allocations: 2.749 GiB, 7.79% gc time)

That 0 is causing a type instability somwhere there, which you can see with @code_warntype:

julia> @code_warntype hvcat((2,2),z^2,z/(z-2),0,1/z)
Variables

│   %26 = Core._apply_iterate(Base.iterate, DescriptorSystems.promote_to_rtfs, %25, A)::Tuple{RationalTransferFunction{Int64,_A} where _A,RationalTransferFunction{Int64,_A} where _A,**Any**,RationalTransferFunction{Int64,_A} where _A}

Note the Any there.

Without the zero, that one is solved, although the result is still not ideal and you have in the body of the function this marked in red (and some other instabilities at the begining):

::NTuple{4,Array{_A,1} where _A}

Those instabilities, particularly the way the zero is being propagated, may be part of the problem.

even if one uses 0*z instead of 0, compilation proceeds much faster:

julia> @time hvcat((2,2),z^2,z/(z-2),0*z,1/z)
  1.051029 seconds (4.42 M allocations: 217.929 MiB, 4.95% gc time)
7 Likes

Thanks. This is a very interesting observation. And I can confirm your finding. I got the following timing results:

julia> @time Gd = [z^2 z/(z-2); zero(z) 1/z];
  0.923385 seconds (4.43 M allocations: 218.465 MiB, 13.81% gc time)

julia> @time Gd = [z^2 z/(z-2); 0 1/z];
 12.847730 seconds (51.68 M allocations: 2.894 GiB, 7.68% gc time)

Everything is 10 times more because of this only zero!

The two constructions use different hvcat calls. For the first call I changed a little bit the code, so I am now getting even a better time:

julia> @time Gd = [z^2 z/(z-2); zero(z) 1/z]
  0.787328 seconds (4.43 M allocations: 218.295 MiB, 4.43% gc time)

For the second call I also changed the generation process to enhance the generated type (thanks for the hint to that). However this had no improvements in the time

julia> @time Gd = [z^2 z/(z-2); 0 1/z]
 13.008802 seconds (51.68 M allocations: 2.894 GiB, 8.70% gc time)

But, now I have a drastic reduction of compilation time from about 600 seconds to about 20 seconds for my original problem

julia> @time sysd = dss(Gd,minimal = true,Ts = 1);
 20.359212 seconds (58.55 M allocations: 2.584 GiB, 4.38% gc time)

I think, this reduction is the result of enforcing the type Array{RationalTransferFunction{Int64,:z},2}forGd, while previously this was Array{RationalTransferFunction{Int64,X} where X,2}`, which probably led to the huge compilation effort.

I must add, that having previously this figure, I had certainly not expressed any “frustration”, since until now I considered the occurence of such figures as normal/acceptable. Now, I am reflecting if my modesty was justified, with the hope that even better times can be achieved.

Many thanks for the help. I appreciate it really a lot.

12 Likes

That is great. I still think you should explore this, because probably that can be much better yet. I don’t see why this should take 13 seconds if using 0*z instead of 0 takes 1 second. Solving the way the types are propagated there will probably improve further your compilation time and possibly also the running time.

I think it’s this bug in Julia: https://github.com/JuliaLang/julia/issues/39240. Instantiating a literal heterogeneous array that would be promoted anyway to a homogenous one is much slower than writing the homogeneous one directly.

This reminds me of the following cases where collect and hvcat are suprisingly inefficient:

julia> @btime collect($(1.0, 1.0, 1.0, 1.0, 1.0, 0))
  226.770 ns (15 allocations: 592 bytes)

and for hvcat:

julia> f() = hvcat((3,3), 1, 1, 1, 1, 1, 1.0);

julia> @btime f()
  209.346 ns (8 allocations: 528 bytes)

and even a case with uniform element types:

julia> g(x) = [ x[1]  x[2]  x[3]
                x[1]  x[2]  x[3] ];

julia> @btime g($[1.0, 1.0, 1.0])
  187.309 ns (8 allocations: 288 bytes)

I don’t know if it’s related…

The problem here is compilation time, not really execution time.

2 Likes