Extremely high first call latency Julia 1.6 versus 1.5 with multiphysics PDE solver

Dear all,

we are experiencing extremely high first call latency times with Julia 1.6 when running our geophysical flows multiphysics finite element PDE solver (on the order of 15 hours on a supercomputer node with 192 GBytes of mem; on a local laptop it is killed after some hours by the OS due to memory consumption). Obviously, this significantly geopardizes development.

On the other hand, it is surprising that with Julia 1.5 the first call latency of the very same code is only around 15 mins on the local laptop referred above.

For the records, the code is available in the public domain in the following git repo

https://github.com/gridapapps/GridapGeosciences.jl

And can be executed as

git clone https://github.com/gridapapps/GridapGeosciences.jl
cd GridapGeosciences.jl
git checkout remove_dependency_pardiso
julia --project=.
using Pkg; Pkg.instantiate()
include("test/Williamson2ThetaMethodFullNewtonTests.jl")

We would like to understand the cause of the problem. It seems that some feature(s) in our code are challenging the Julia compiler performance, but it is really hard to say without having a deep understanding of the underlying machinery. It would be ideal to come up with a MWE, but the code complexity and the fact that it stalls with Julia 1.6 does not facilitate the task.

Is there anyone in the community experiencing similar issues? Which performance analysis tools may we use to detect the cause of the problem? We guess that this issue might be related to How to cut down compile time when inference is not the problem?, but we are not certain.

Thanks in advance for your help!
Regards,
Alberto.

4 Likes

Maybe this thread gives you some insight: Understanding and optimizing compiler time (just a bit)

1 Like

FWIW, things get “stuck” for me when the time stepper function is first called, i.e. https://github.com/gridapapps/GridapGeosciences.jl/blob/master/test/Williamson2ThetaMethodFullNewtonTests.jl#L37

I tried to run your MWE and was stymied by the non-minimal requirements of GridapPardiso.jl. As a longer-term fix, you may want to instead target Pardiso.jl, which should automatically grab BinaryBuilder.jl-provided MKL binaries.

Depending on how your cluster’s filesystem is set up, you may be running into this issue.

edit: pending fix targeted for Julia 1.8.

1 Like

Ok. The same here. Which version of Julia are you using? 1.6.3?

Thanks for that. I was aware of the BinaryBuilder.jl provided MKL binaries, but did not have the time to set them up yet in GridapPardiso. Definitely we have to leverage that in GridapPardiso as well, much easier.

I have modified the commit sha in my original post to point to a commit in which GridapPardiso is no longer necessary.

1 Like

Have you tried running this on 1.7 (or a recent master)?

1 Like

Not yet. But I will definitely give it a try and let you know the outcome. Thanks!

1 Like

I tried 1.7 and it seems there is the same issue, it gets “stuck” at the same place as with Julia 1.6

You should open an issue, even if you don’t have a minimal example. Maybe one thing to do before, if you find time, is to check whether compiling a julia of the branch in above linked PR resolves this issue: https://github.com/JuliaLang/julia/pull/42156

2 Likes

When I’m tune my model on Pluto.jl with revise.jl, I found that sometimes, maybe once or twice a day, when I change some parameter of my model, and rerun the cell, it get stucked for several minutes.
I don’t really know what it is doing, even the first time only takes several seconds, but it could get stucked for minutes for no real reason.
How can I debug that?

A follow-up on this post. I run the following code to get some insight on the issue

using SnoopCompile
using AbstractTrees
tinf = @snoopi_deep include("test/Williamson2ThetaMethodFullNewtonTests.jl")
print_tree(tinf)
println(SnoopCompile.flatten(tinf))

I took around 10 hours in a compute node of a cluster, and it seems that it finished succesfully generating the SnoopCompile report (i.e., it does not actually get stuck, it finishes at last but after a very long time). Note that the code runs sequentially, it does not use Distributed. I run on the cluster because I do not have enough computational resources on my local machine.

I can share the full report if that helps, but it is a long file (27 MBytes compressed, 1.5 GBytes uncompressed).

The header of the @snoopi_deep macro call is as follows

InferenceTimingNode: 19820.730026/19999.977028 on Core.Compiler.Timings.ROOT() with 2678 direct children

If I understand correctly, this means that almost the full compilation time is spent in phases different from type inference. Might it be some sort of botteneck related to LLVM code generation?

For the records, I have reported the issue at Julia github https://github.com/JuliaLang/julia/issues/43206

2 Likes

Thanks for reporting! Aside: it’s considered good style to make a github-issue self contained. A part of your top-post would do just fine for this purpose. It’s good and ok to link to the discourse thread for additional info.

Hello,

your test seems to run in the global environment. This is not optimal as far I have learned. I have therefore modified it locally to

module Williamson2ThetaMethodFullNewtonTests

using Test
using Gridap
using GridapGeosciences
using SparseMatricesCSR

# Solves the steady state Williamson2 test case for the shallow water equations on a sphere
# of physical radius 6371220m. Involves a modified coriolis term that exactly balances
# the potential gradient term to achieve a steady state
# reference:
# D. L. Williamson, J. B. Drake, J. J.HackRüdiger Jakob, P. N.Swarztrauber, (1992)
# J Comp. Phys. 102 211-224

include("Williamson2InitialConditions.jl")

function test()

l2_err_u = [0.011370921987771046 , 0.002991229234176333 ]
l2_err_h = [0.005606685579166809, 0.001458107077200681 ]

order=1
degree=4
θ=0.5
for i in 1:2
  n      = 2*2^i
  nstep  = 5*n
  Uc     = sqrt(g*H₀)
  dx     = 2.0*π*rₑ/(4*n)
  dt     = 0.25*dx/Uc
  println("timestep: ", dt)   # gravity wave time step
  T      = dt*nstep
  τ      = dt/2
  model = CubedSphereDiscreteModel(n; radius=rₑ)
  hf, uf = shallow_water_theta_method_full_newton_time_stepper(model, order, degree,
                                                               h₀, u₀, f₀, topography, g, θ, T, nstep, τ;
                                                               write_solution=false,
                                                               write_solution_freq=5,
                                                               write_diagnostics=true,
                                                               write_diagnostics_freq=1,
                                                               dump_diagnostics_on_screen=true)

  Ω     = Triangulation(model)
  dΩ    = Measure(Ω, degree)
  hc    = CellField(h₀, Ω)
  e     = h₀-hf
  err_h = sqrt(sum(∫(e⋅e)*dΩ))/sqrt(sum(∫(hc⋅hc)*dΩ))
  uc    = CellField(u₀, Ω)
  e     = u₀-uf
  err_u = sqrt(sum(∫(e⋅e)*dΩ))/sqrt(sum(∫(uc⋅uc)*dΩ))
  println("n=", n, ",\terr_u: ", err_u, ",\terr_h: ", err_h)

  #@test abs(err_u - l2_err_u[i]) < 10.0^-12
  #@test abs(err_h - l2_err_h[i]) < 10.0^-12
end

end

end # module

I’ve run this test on both Julia 1.5.4 and 1.6.4 and can confirm a dramatic increase in execution time for the code.

Would it be possible to simplify the code (i.e. parameters like grid size, number of equations etc.) in Julia 1.5 to run in say seconds instead of minutes? I’d expect a MWE on Julia 1.6 to run in minutes instead of hours then, which might simplify things for analysis.

1 Like

I’ve run the code on Julia 1.5.4 / 1.6.4 with the command line switch --trace-compile. This generates a 100 MB file on Julia 1.5.4. I stopped the corresponding run on 1.6.4 after a couple of minutes at a size of around 15 MB. The diff of the log files is similar. The first function not shown in the 1.6.4 log is

precompile(Tuple{typeof(Gridap.FESpaces._numeric_loop_vector!)
[cut 300 kB type signature]

So are we talking about a problem in GridapGeosciences.jl or in Gridap.jl?

A long shot here. Here, you call:

residualΔuΔhqF=residual(ΔuΔhqF,dY)

while the function residual has this signature:

    for step=1:N
      function residual((Δu,Δh,qvort,F),(v,q,s,v2))

and is being defined from within a loop of size N (I don’t know how large is N).

By any chance you are somehow creating nested tuples in these loops?

I have annotated the code in many parts, and every step of the code seems to take a long time, it seems that you may have enormously large tuples being carried over causing overspecialization of the code everywhere.

Note this small example:

julia> x = rand(1000,1000); y = rand(1000,1000); z = rand(1000,1000);

julia> x + y + z; # just to compile the sum

julia> f(x,y,z) = x + y + z
f (generic function with 1 method)

julia> @time f(x,y,z);
  0.005496 seconds (606 allocations: 7.672 MiB, 37.24% compilation time)

julia> g((a,b,c)) = a + b + c
g (generic function with 1 method)

julia> @time g((x,y,z));
  0.007508 seconds (8.85 k allocations: 8.206 MiB, 54.00% compilation time)

I am not sure if this is really meaningful, but as note the difference in particular of the allocations.

4 Likes

It seems (not completely sure), that changing the how to call residuals like this:

  function residual(Δu,Δh,qvort,F,v,q,s,v2)
         one_m_θ = (1-θ)
...
    residualΔuΔhqF=residual(Δu, Δh, q, F,dY[1],dY[2],dY[3],dY[4])

made me pass this point, where apparently I was getting stuck. And now I get stuck in the next call, which again passes dY as single variable.

I would be suspicious in the sense of the exaggerated use of tuples and immutable structures as containers of large and variable type of data. But this is only a guess, of course.

I was not able to run the example in 1.5. There is this change between versions, and compile times are mentioned there: https://github.com/JuliaLang/julia/pull/35976

It should be possible to run the tests before and after that change to see if you are reaching a pathological case of the possible problems introduced by that change.

2 Likes

Ok, done. Thanks for the suggestion.

Thanks for the info. GridapGeosciences.jl highly relies on Gridap.jl. Thus, I wouldnt be surprised if the bulk of the compilation times are concentrated in Gridap.jl’s code triggered from GridapGeosciences.jl.