Extremely high first call latency Julia 1.6 versus 1.5 with multiphysics PDE solver

I tried 1.7 and it seems there is the same issue, it gets “stuck” at the same place as with Julia 1.6

You should open an issue, even if you don’t have a minimal example. Maybe one thing to do before, if you find time, is to check whether compiling a julia of the branch in above linked PR resolves this issue: https://github.com/JuliaLang/julia/pull/42156

2 Likes

When I’m tune my model on Pluto.jl with revise.jl, I found that sometimes, maybe once or twice a day, when I change some parameter of my model, and rerun the cell, it get stucked for several minutes.
I don’t really know what it is doing, even the first time only takes several seconds, but it could get stucked for minutes for no real reason.
How can I debug that?

A follow-up on this post. I run the following code to get some insight on the issue

using SnoopCompile
using AbstractTrees
tinf = @snoopi_deep include("test/Williamson2ThetaMethodFullNewtonTests.jl")
print_tree(tinf)
println(SnoopCompile.flatten(tinf))

I took around 10 hours in a compute node of a cluster, and it seems that it finished succesfully generating the SnoopCompile report (i.e., it does not actually get stuck, it finishes at last but after a very long time). Note that the code runs sequentially, it does not use Distributed. I run on the cluster because I do not have enough computational resources on my local machine.

I can share the full report if that helps, but it is a long file (27 MBytes compressed, 1.5 GBytes uncompressed).

The header of the @snoopi_deep macro call is as follows

InferenceTimingNode: 19820.730026/19999.977028 on Core.Compiler.Timings.ROOT() with 2678 direct children

If I understand correctly, this means that almost the full compilation time is spent in phases different from type inference. Might it be some sort of botteneck related to LLVM code generation?

For the records, I have reported the issue at Julia github https://github.com/JuliaLang/julia/issues/43206

2 Likes

Thanks for reporting! Aside: it’s considered good style to make a github-issue self contained. A part of your top-post would do just fine for this purpose. It’s good and ok to link to the discourse thread for additional info.

Hello,

your test seems to run in the global environment. This is not optimal as far I have learned. I have therefore modified it locally to

module Williamson2ThetaMethodFullNewtonTests

using Test
using Gridap
using GridapGeosciences
using SparseMatricesCSR

# Solves the steady state Williamson2 test case for the shallow water equations on a sphere
# of physical radius 6371220m. Involves a modified coriolis term that exactly balances
# the potential gradient term to achieve a steady state
# reference:
# D. L. Williamson, J. B. Drake, J. J.HackRüdiger Jakob, P. N.Swarztrauber, (1992)
# J Comp. Phys. 102 211-224

include("Williamson2InitialConditions.jl")

function test()

l2_err_u = [0.011370921987771046 , 0.002991229234176333 ]
l2_err_h = [0.005606685579166809, 0.001458107077200681 ]

order=1
degree=4
θ=0.5
for i in 1:2
  n      = 2*2^i
  nstep  = 5*n
  Uc     = sqrt(g*H₀)
  dx     = 2.0*π*rₑ/(4*n)
  dt     = 0.25*dx/Uc
  println("timestep: ", dt)   # gravity wave time step
  T      = dt*nstep
  τ      = dt/2
  model = CubedSphereDiscreteModel(n; radius=rₑ)
  hf, uf = shallow_water_theta_method_full_newton_time_stepper(model, order, degree,
                                                               h₀, u₀, f₀, topography, g, θ, T, nstep, τ;
                                                               write_solution=false,
                                                               write_solution_freq=5,
                                                               write_diagnostics=true,
                                                               write_diagnostics_freq=1,
                                                               dump_diagnostics_on_screen=true)

  Ω     = Triangulation(model)
  dΩ    = Measure(Ω, degree)
  hc    = CellField(h₀, Ω)
  e     = h₀-hf
  err_h = sqrt(sum(∫(e⋅e)*dΩ))/sqrt(sum(∫(hc⋅hc)*dΩ))
  uc    = CellField(u₀, Ω)
  e     = u₀-uf
  err_u = sqrt(sum(∫(e⋅e)*dΩ))/sqrt(sum(∫(uc⋅uc)*dΩ))
  println("n=", n, ",\terr_u: ", err_u, ",\terr_h: ", err_h)

  #@test abs(err_u - l2_err_u[i]) < 10.0^-12
  #@test abs(err_h - l2_err_h[i]) < 10.0^-12
end

end

end # module

I’ve run this test on both Julia 1.5.4 and 1.6.4 and can confirm a dramatic increase in execution time for the code.

Would it be possible to simplify the code (i.e. parameters like grid size, number of equations etc.) in Julia 1.5 to run in say seconds instead of minutes? I’d expect a MWE on Julia 1.6 to run in minutes instead of hours then, which might simplify things for analysis.

1 Like

I’ve run the code on Julia 1.5.4 / 1.6.4 with the command line switch --trace-compile. This generates a 100 MB file on Julia 1.5.4. I stopped the corresponding run on 1.6.4 after a couple of minutes at a size of around 15 MB. The diff of the log files is similar. The first function not shown in the 1.6.4 log is

precompile(Tuple{typeof(Gridap.FESpaces._numeric_loop_vector!)
[cut 300 kB type signature]

So are we talking about a problem in GridapGeosciences.jl or in Gridap.jl?

A long shot here. Here, you call:

residualΔuΔhqF=residual(ΔuΔhqF,dY)

while the function residual has this signature:

    for step=1:N
      function residual((Δu,Δh,qvort,F),(v,q,s,v2))

and is being defined from within a loop of size N (I don’t know how large is N).

By any chance you are somehow creating nested tuples in these loops?

I have annotated the code in many parts, and every step of the code seems to take a long time, it seems that you may have enormously large tuples being carried over causing overspecialization of the code everywhere.

Note this small example:

julia> x = rand(1000,1000); y = rand(1000,1000); z = rand(1000,1000);

julia> x + y + z; # just to compile the sum

julia> f(x,y,z) = x + y + z
f (generic function with 1 method)

julia> @time f(x,y,z);
  0.005496 seconds (606 allocations: 7.672 MiB, 37.24% compilation time)

julia> g((a,b,c)) = a + b + c
g (generic function with 1 method)

julia> @time g((x,y,z));
  0.007508 seconds (8.85 k allocations: 8.206 MiB, 54.00% compilation time)

I am not sure if this is really meaningful, but as note the difference in particular of the allocations.

4 Likes

It seems (not completely sure), that changing the how to call residuals like this:

  function residual(Δu,Δh,qvort,F,v,q,s,v2)
         one_m_θ = (1-θ)
...
    residualΔuΔhqF=residual(Δu, Δh, q, F,dY[1],dY[2],dY[3],dY[4])

made me pass this point, where apparently I was getting stuck. And now I get stuck in the next call, which again passes dY as single variable.

I would be suspicious in the sense of the exaggerated use of tuples and immutable structures as containers of large and variable type of data. But this is only a guess, of course.

I was not able to run the example in 1.5. There is this change between versions, and compile times are mentioned there: https://github.com/JuliaLang/julia/pull/35976

It should be possible to run the tests before and after that change to see if you are reaching a pathological case of the possible problems introduced by that change.

2 Likes

Ok, done. Thanks for the suggestion.

Thanks for the info. GridapGeosciences.jl highly relies on Gridap.jl. Thus, I wouldnt be surprised if the bulk of the compilation times are concentrated in Gridap.jl’s code triggered from GridapGeosciences.jl.

Would it be possible to simplify the code (i.e. parameters like grid size, number of equations etc.) in Julia 1.5 to run in say seconds instead of minutes? I’d expect a MWE on Julia 1.6 to run in minutes instead of hours then, which might simplify things for analysis.

There is this post. How to cut down compile time when inference is not the problem? However, I do not actually know if there is actually relation among the causes underlying the observations in this thread and the one at that thread.

1 Like

Then they might want to modify all @noinline methods in SparseMatrixAssemblers.jl adding a @nospecialize like so:

@noinline function _numeric_loop_vector!(vec,caches,cell_vals,cell_rows)
  @nospecialize
  add_cache, vals_cache, rows_cache = caches
  @assert length(cell_vals) == length(cell_rows)
  add! = AddEntriesMap(+)
  for cell in 1:length(cell_rows)
    rows = getindex!(rows_cache,cell_rows,cell)
    vals = getindex!(vals_cache,cell_vals,cell)
    evaluate!(add_cache,add!,vec,vals,rows)
  end
end

This seems to make things workable again for me on Julia 1.7.0-rc3. But be aware of this open issue.

1 Like

Hi @goerch! I have implemented function _numeric_loop_vector! in Gridap.jl

To give more background, in this function call, caches and cell_vals are potentially VERY complex nested inmutable objects (so no surprise that this is a hot spot). I am curious how @nospecialize would affect the run time performance taking into account that length(cell_rows) can be of the order of 10^7 for a large finite element mesh.

This seems to make things workable again for me on Julia 1.7.0-rc3. But be aware of this open issue.

How about calling Base.inferencebarrier on caches and cell_vals, which are the bad guys? Do you think this can also improve things?

I am aware that we have complex types in Gridap.jl, but it was not a fatal problem until Julia 1.6 for sophisticated equations.

Hi @fverdugo,

my first suspicion would be that the increased compile time is due to new optimizations. So I’d expect a discussion about missed optimizations next, of course;)

Will check the example from here with @btime next.

Edit: done.

1 Like

I have an issue with compile time latency with a neural work in Julia 1.6 and 1.7-rc2 (compared to Julia 1.5) which might be related:

If one uses the option -O1 the compile times are vastly reduced. When inspecting the run with callgrind to CPUs time is mostly spend in LLVM even on the second call of the same function with the same arguments (using the default optimization level).

I tried to compile Julia from source changing the parameters max_methods, tupletype_depth, tuple_splat, inline_tupleret_bonus but without luck.

When I interrupt Julia during the excessive compilation, it seemed to be busy in the DAGCombiner phase of LLVM.

2 Likes

Interesting, I didn’t even get the idea: is this simply an(occasionally very) expensive optimization disabled in -O1 and enabled in -O2?

Edit: OK i checked this hypothesis for @amartinhuertas problem, and indeed: the original test case works in reasonable time on Julia 1.6.4 without adding @nospecialize if one simply reduces the optimization level to -O1!

Edit: updated https://github.com/JuliaLang/julia/issues/43206#issuecomment-980599991

2 Likes

Trying to reduce the original MWE I stumbled about quite some type instabilities and have the distinct impression of a relation between type instabilities and longer compile times on -O2 at least under newer versions of Julia (I worked with 1.8.0 to make the best use of JET). Therefore I filed gridapapps/GridapGeosciences.jl#28 and gridapapps/GridapGeosciences.jl#29.

Can someone confirm the impression?

Edit: not a native speaker…

For the records, we could already “solve” (bypass) the issue. See https://github.com/JuliaLang/julia/issues/43206#issuecomment-983474073 for more details.

1 Like