How to solve implicitly a 2D Heat diffusion equation / ParallelStencil / DifferentialEquation /

Context

I am no simulation nor numerical expert and am absolutely not familiar with differential Equations.
So I am trying to simulate heat diffusion in a 2D non uniform material on a Cartesian rectilinear mesh (varying Δy and Δx) with the given properties for each cell at l, m coordinates

Ω : surface
ρ : volumic mass
cp : thermal capcity
λ : thermal conductivity

I intend to run it on really really large domains, with distributed memory multiple CPUs and possibly multiple GPUs.
Thus i choose ParallelStencil.jl coupled with ImplicitGlobalGrid.jl .

The Right hand side (laplacian?) of the equation is the following
image

Explicit

I managed to implement easily a fairly efficient explicit version using with the following kernel

@parallel_indices (ix,iy) function step_explicit_2D_indices!(
H :: AbstractArray, # Temperature at time n
H2 :: AbstractArray, # Temperature at time n+1
λ ,
Δx ,
Δy ,
_ρcpΩ ,
Δt ,
_avΔx,
_avΔy)

if (ix>1 && ix<size(H,1) && iy>1 && iy<size(H,2))
    H2[ix,iy] =  H[ix,iy] +

        # Diffusivity
        (Δt * _ρcpΩ[ix,iy]) *

        #Right Hand Side/laplacian
        # X flux
        ((@avHarm_R(λ) * @d_R(H, ix, iy)    * _avΔx[ix]) + 
            (@avHarm_L(λ) * @d_L(H, ix, iy) * _avΔx[ix-1])* Δy[iy-1] + 

        # Y Flux
        ((@avHarm_T(λ) * @d_T(H, ix, iy)    * _avΔy[iy]) + 
            (@avHarm_B(λ) * @d_B(H, ix, iy) * _avΔy[iy-1])* Δx[ix-1]));
end
return nothing

end

with

@avHarm<S> Harmonic average of λ on a given Side
@d_<S>     Heat difference on a given side
_avΔx      Arithmetic average of Deltas x/y

Implicit

However I’d want it to be unconditionally stable given the time step Δt, so i have to solve it Implicitly.

Direct Method

I think (i might be wrong) that direct linear solver (Tn+1 = A \ Tn ) will not work well at all for large domains, I tried this approach, constructing a sparse five banded laplacian matrix (CSC format) ( nx* ny by nx*ny), the backslash Operator solving the linear system, utilizes a single thread.

Is it possible to solve linear systems in parallel?

Itterative

I did find an example of Implicit iterative solver on the following Github repo of ParallelStencil workshop at Juliacon 2021
But I don’t have the knowledge to adapt it to my situation and fail to gasp the concept of iterative solving.

How could i iteratively solve the heat equation with ParallelStencil given my model?

There is also DifferentialEquations.jl but i fail to understand how to discretize my equation with it and it appears complicated to me. I am not sure if it can do distributed computing.
Could DifferentialEquation.jl be used in my case?

Thanks for any help

Hi @Eurel, welcome and thanks for posting your question. If you are interested in running simulations on large domains using multi-CPUs/GPUs, ParallelStencil and ImplicitGlobalGrid may be very appropriate.

How could i iteratively solve the heat equation with ParallelStencil given my model?

I prepared 3 codes solving the equation you are interested in using an explicit and implicit approach. You can find the codes in this GitHub PTdiffusion_mwe repository. The repo contains 3 codes:

Some additional info is in the repo’s README, including further references to a paper on the pseudo-transient method and a Julia GPU course (https://pde-on-gpu.vaw.ethz.ch) which may be a valuable source of information.

6 Likes

Yes, and there’s an issue which shows how to do it:

We generally find using a Runge-Kutta Chebyshev method as quite a bit faster than the other techniques in this kind of scenario, so we’d recommend looking at ROCK2 or ROCK4 before looking at something implicit, though things like FBDF (maybe requires FBDF(autodiff=false)) or CVODE_BDF (also look at CVODE_BDF(linear_solver=:GMRES) or CG, depending on positive-definiteness of the Jacobian). These should be much much faster than the handwritten time stepping loops, last time I checked it was >10x.

For more details on optimizing time-step choices for large PDE codes, you may want to check out this tutorial:

https://diffeq.sciml.ai/stable/tutorials/advanced_ode_example/

It should parallelize using UMFPACK, though using MKLPardiso might be a good choice here:

http://linearsolve.sciml.ai/dev/solvers/solvers/#Pardiso.jl

Though most likely that will be outperformed by a good tolerance choice of iLU (or multigrid) mixed with GMRES/CG.

1 Like

for iterative methods, ie for inverting I-dt Delta, you might need a preconditioner

Yes, and the preconditioners are defined in the tutorial Solving Large Stiff Equations · DifferentialEquations.jl. If you can get the sparsity pattern (which is generated automatically in the tutorial using Symbolics.jl), then iLU and algebraic multigrid preconditioners are pretty easy to specify.

So by using DifferentialEquations, if i got your code example right, I do not have to write the explicit time stepping anymore, I just have to write a kernel exhibiting the evolution of du (T2) over u (T) independently of the Δt ? And then the algorithm I choose will dynamically tune the the time stepping over the solving part ?
It is indeed way faster than iterating by hand !

If I want to solve this implicitly using DifferentialEquations, do i have simply to declare the problem as a SplitODEProblem and use other algorithms listed here ?

Yes indeed.

No, if you just do something like solve(prob,CVODE_BDF(linear_solver=:GMRES)) then it’s already going to do implicit with Newton-Krylov. You just change the solver, nothing else is required.

SplitODEProblem is only required if you want to be doing IMEX (mixing implicit and explicit time steps).

I contented myself of copypasting the sparse jacobian declaration and algebraic multigrid the preconditionner here as I thought it would do it automaticaly. I found algebraicmultigrid better than iLU

using ModelingToolkit,DifferentialEquations, Plots, OrdinaryDiffEq
using LinearAlgebra, SparseArrays , Symbolics
using BenchmarkTools
using AlgebraicMultigrid


function algebraicmultigrid2(W,du,u,p,t,newW,Plprev,Prprev,solverdata)
    if newW === nothing || newW
      A = convert(AbstractMatrix,W)
      Pl = AlgebraicMultigrid.aspreconditioner(AlgebraicMultigrid.ruge_stuben(A, presmoother = AlgebraicMultigrid.Jacobi(rand(size(A,1))), postsmoother = AlgebraicMultigrid.Jacobi(rand(size(A,1)))))
    else
      Pl = Plprev
    end
    Pl,nothing
end

using IncompleteLU
function incompletelu(W,du,u,p,t,newW,Plprev,Prprev,solverdata)
  if newW === nothing || newW
    Pl = ilu(convert(AbstractMatrix,W), τ = 0.01)
  else
    Pl = Plprev
  end
  Pl,nothing
end
Base.eltype(::IncompleteLU.ILUFactorization{Tv,Ti}) where {Tv,Ti} = Tv

With the following diffusion function

# Harmonic average
macro Harm(A,B) esc(:(2 .*($A .* $B)./($A .+ $B) ))  end
# Arithmetic average
macro Avg(A,B)  esc(:(($A .+ $B).*0.5)) end
# Difference
macro Dif(A,B)  esc(:($A .- $B)) end

function diffuse!(dT,T,p,t)
        nx, ny, λ, _ρcpΩ, Δx, Δy = p
        _ρcpΩInn = view(_ρcpΩ,2:nx-1,2:ny-1)

        ΔxLeft  = view(Δx,1:nx-2) 
        ΔxRight = view(Δx,3:nx) 
        ΔxInn   = view(Δx,2:nx-1)
        ΔyBot   = view(Δy,1:ny-2)
        ΔyTop   = view(Δy,3:ny)
        ΔyInn   = view(Δy,2:ny-1)

        λLeft   = view(λ,1:nx-2 ,2:ny-1) 
        λRight  = view(λ,3:nx   ,2:ny-1)
        λTop    = view(λ,2:nx-1 ,3:ny)
        λBot    = view(λ,2:nx-1 ,1:ny-2)
        λInn    = view(λ,2:nx-1,2:ny-1)

        TLeft   = view(T,1:nx-2 ,2:ny-1) 
        TRight  = view(T,3:nx   ,2:ny-1)
        TTop    = view(T,2:nx-1 ,3:ny)
        TBot    = view(T,2:nx-1 ,1:ny-2)
        TInn    = view(T,2:nx-1,2:ny-1)
        
        dTInn   = view(dT,2:nx-1,2:ny-1) 

        @. dTInn =  _ρcpΩInn * ( 
            (( (@Harm(λRight,λInn)  * @Dif(TRight,TInn) / @Avg(ΔxRight,ΔxInn)) +
                (@Harm(λInn,λLeft)  * @Dif(TLeft,TInn)  / @Avg(ΔxInn,ΔxLeft))) * ΔyInn )+
            (( (@Harm(λTop,λInn)    * @Dif(TTop,TInn)   / @Avg(ΔyTop,ΔyInn)) +
                (@Harm(λInn,λBot)   * @Dif(TBot,TInn)   / @Avg(ΔyInn,ΔyBot))) * ΔxInn )
            )

end

Runnable script:

# Domain Definition
tmax    = 0.001
tspan   = (0.0,tmax)
d       = 128
nx, ny  = d, d
lx, ly  = 1.0, 1.0
dx, dy  = (lx,ly) ./ (nx,ny)
Δx = fill(dx,nx)
Δy = fill(dy,ny)
λ      = 1 .*ones(nx,ny)
_ρcpΩ  = 1.0 ./( Δx' .* Δy)

# Init Cond
T = zeros(nx,ny)
T  .= Array([100*exp(-(((ix)*dx-lx/2)/2)^2-(((iy)*dy-ly/2)/2)^2) +
                          50*exp(-(((ix)*dx-lx/2)/2)^2-(((iy)*dy-ly/2)/2)^2) for ix=1:size(T,1), iy=1:size(T,2)])


# For Visualisation
Tmin = minimum(T)
Tmax = maximum(T)

# Params
p = [nx, ny, λ, _ρcpΩ, Δx, Δy]

# Sparsity patern of Jacobian
dT = copy(T)
@time jac_sparsity = Symbolics.jacobian_sparsity((du,u)->diffuse!(du,u,p,0.0),dT,T)
f     = ODEFunction(diffuse!;jac_prototype=float.(jac_sparsity))
prob  = ODEProblem(f,T,tspan,p)

# Solver preconditioner
prec = algebraicmultigrid2
# 0.016s  Single threaded
@btime sol = solve(prob,ROCK4(),save_everystep=false)

# 0.138s Single threaded
@btime sol = solve(prob,RK4(),save_everystep=false)

##  autodiff=true

# 0.327s Multithreaded
@btime sol =  solve(prob,FBDF(linsolve=UMFPACKFactorization(),precs=prec,concrete_jac=true),save_everystep=false)
# 2.572s Multithreaded
@btime sol =  solve(prob,FBDF(linsolve=KrylovJL_GMRES(),precs=prec,concrete_jac=true),save_everystep=false)

## autodiff=false

# 0.304s Multithreaded
@btime sol = solve(prob,FBDF(linsolve=UMFPACKFactorization(),autodiff=false,precs=prec,concrete_jac=true),save_everystep=false)
#0.409s Multithreaded
@btime sol = solve(prob,FBDF(linsolve=KrylovJL_GMRES(),autodiff=false,precs=prec,concrete_jac=true),save_everystep=false)

# 0.257s  Multithreaded
@btime sol = solve(prob,TRBDF2(linsolve=UMFPACKFactorization(),precs=prec,concrete_jac=true),save_everystep=false)

and then there is Sundials CVODE_BDF that is blazing fast

using Sundials
#0.010s Singlethreaded
@btime sol = solve(prob,CVODE_BDF(linear_solver=:GMRES),save_everystep=false)

Is there something wrong with my discretisation of the Heat Diffusion equation, with my Jacobian or preconditionners for FBDF, QNDF and TRBDF2 to be that much outperformed?

Why are ROCK4 and RK4 and CVODE_BDF runing on only one thread whereas TRBDF and FBDF are using (some of) the ones I launched Julia with?

What did QNDF and KenCarp47 give? ROCK2? It looks like this is in the “essentially non-stiff” regime, so look at Vern7, Tsit5, and BS3 as well.

It depends on the solver process. Since your f function is not parallelized in this definition, it’s only going to multithread if the solver does it. Things like UMFPACK factorizations will multithread in the factorization, and that’s why you see it.

QNDF exceeded my patience
here are the other ones

# 0.514s
@btime sol = solve(prob,KenCarp47(linsolve=UMFPACKFactorization(),precs=prec,concrete_jac=true),save_everystep=false)

# 0.034s
@btime sol = solve(prob,ROCK2(),save_everystep=false)

# 0.154s
@btime sol = solve(prob,Vern7(),save_everystep=false)

# 0.105s
@btime sol = solve(prob,Tsit5(),save_everystep=false)

# 0.058s
@btime sol = solve(prob,BS3(),save_everystep=false)

I see i have to redefine the problem as an EsembleProblem to get Multithreading or GPU computing working, gonna check it out

Okay yes, this is essentially non-stiff. Look at VCABM

For iterative methods to match sundials, you have to tune the precision of the linear solver. I recently found that the defaults are too “precise”, hence requires a lot of (linear) iterations. Basically, you have to match the settings of GMRES from cvode

I have to tune the precision of KrylovJL_GMRES / UMFPACKFactorization / KLUFactorization ?

of KrylovJL_GMRES

I don’t know how to proceed :sweat_smile:
Does it takes parameters or how could I retrieve settings of GMRES from cvode ?

Here is an example that I use

algMF = Rodas4P(linsolve = KrylovJL_GMRES(rtol = 1e-12, atol = 1e-10, itmax = 20, verbose = 0), concrete_jac = false, precs = precFwave)

By turning versbose = 1, you can monitor the number of iterations. You can also limit itmax and see if solve returns instability. If not, decrease itmax or rtol

2 Likes

Is the solver supposed to prompt it when there’s an instability?

It does. Question then becomes about balance. If you iterate too long, then you are doing useless compute. If you don’t iterate long enough, then a solve is unstable, it causes a step rejection and a smaller time step, and thus more compute. There’s a happy middle, and we haven’t optimized the setups of the Krylov methods to automatically hit that at this point.