Solving HJB PDE using deep learning

Dear All,

sorry for bothering with a possibly stupid question.

I would like to ask for some advice about solving the Hamilton-Jacobi-Bellman equation using deep learning. I am trying to solve an HJB equation of a simple neoclassical growth model from macroeconomics. Its HJB equation is

\rho V(k_t) = \frac{c_t^{1-\gamma}}{1-\gamma} + \frac{\partial V}{\partial k} \{A k_t^\alpha - \delta k_t - c_t\}
Control variable in this problem is c_t \geq 0, characterized by following first-order condition.
c_t^{-\gamma} = \frac{\partial V}{\partial k}

Following paper by Fernandéz-Villaverde et. al (2020), I am trying to solve this HJB problem using deep learning. I parameterize the value function V(k_t) and consumption function C(k_t) using feed-forward networks designed to guarantee non-negativity of consumption, and use ADAM to minimize a loss function that includes squared error in HJB equation and FOC over a reasonable interval of state-space.
\mathcal{E}_{HJB} = \rho V(k_t) - \frac{c_t^{1-\gamma}}{1-\gamma} - \frac{\partial V}{\partial k} \{A k_t^\alpha - \delta k_t - c_t\}
\mathcal{E}_{FOC} = c_t^{-\gamma} - \frac{\partial V}{\partial k}
\mathcal{L} = \sum_i {\mathcal{E}_{HJB}}_i + \sum_i {\mathcal{E}_{FOC}}_i

While the minimization procedure converges to \approx 0 loss, instead of getting concave and increasing value function, I got something totally weird.

I checked the result for many different choices of activation functions in my networks, but the problem persists, in all these cases I still got this shape of the value function.

In the paper (page 19), the authors show, that minimization of this loss function composed of HJB and FOC error (should?) converges to a solution of accurate upwind finite-difference scheme. Unfortunately, I am getting something totally different. Should I use some boundary conditions? Unfortunately, I think that there aren’t natural boundary conditions in this problem, other than transversality conditions.

I would be really glad for any insight, especially from
@Tamas_Papp and @jlperla

Best,
Honza

Here is a MWE of my code.

#*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
#*-*-* Solve non-stochastic neoclassical growth model in continuous time *-*
#*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*

#(1) Import libraries
using Pkg
Pkg.add("Plots")
Pkg.add("Parameters")
Pkg.add("LinearAlgebra")
Pkg.add("Flux")
Pkg.add("Random")
Pkg.add("Distributions")
Pkg.add("ForwardDiff")
using Plots
using Parameters
using LinearAlgebra
using Flux
using Random
using Distributions
using ForwardDiff
include("FastADAM.jl")
Pkg.add("ZygoteRules")
using ZygoteRules
ZygoteRules.@adjoint function ForwardDiff.Dual{T}(x, ẋ::Tuple) where T
  @assert length(ẋ) == 1
  ForwardDiff.Dual{T}(x, ẋ), ḋ -> (ḋ.partials[1], (ḋ.value,))
end
ZygoteRules.@adjoint ZygoteRules.literal_getproperty(d::ForwardDiff.Dual{T}, ::Val{:partials}) where T =
  d.partials, ṗ -> (ForwardDiff.Dual{T}(ṗ[1], 0),)
ZygoteRules.@adjoint ZygoteRules.literal_getproperty(d::ForwardDiff.Dual{T}, ::Val{:value}) where T =
  d.value, ẋ -> (ForwardDiff.Dual{T}(0, ẋ),)

#(2) Build model structure
function uCRRA(c,γ)
    if(γ==1.0)
        return log(c)
    else
        return (c^(1-γ))/(1-γ)
    end
end
@with_kw struct NCG
    #(A) Structural parameters
    Α::Float32 = 0.5
    α::Float32 = 0.36
    β::Float32 = 1/(1+0.04)
    ρ::Float32 = 1/β-1
    γ::Float32 = 2.0
    δ::Float32 = 0.05
    #(B) Steady-state
    kss::Float32 = (α/(1/β-(1-δ)))^(1/(1-α))
    kl::Float32 = 10^(-4)
    ku::Float32 = 10.0
    #(C) Grid
    ϰ::Int64 = 2500
    𝓚::Array{Float32} = collect(range(kl,ku,length=ϰ))'
    #(D) Approximator
    T::Int64 = 8
    g::Function = swish
end
@with_kw struct Optimizer
  #(C) Networks and ADAM setting
    S1::Float32 = 0.0
    S2::Float32 = 1.0
    ϰ::Int64 = 2500
    ϑ::Int64 = 150
    T::Int64 = 32
    Γ::Float32 = 0.001
    β1::Float32 = 0.9
    β2::Float32 = 0.99
    ϵ::Float32 = 10^(-8)
    ν::Float32 = 10^(-5)
    𝚰::Int64 = 5000000
    𝛀::Float32 = 0.1
    𝒯::Int64 = 1500
    𝓇::Float32 = 0.9
end

#Try swish,gelu,elu,mish,leakyrelu => all leads to the more or less same result
NC1 = NCG(g=swish)
@unpack Α,T,α,β,ρ,γ,δ,kl,ku,𝓚,g = NC1

#(3) Define approximation networks
#(3.1) Define policy network
𝓒 = Chain(Dense(1,T,g),Dense(T,T,g),
Dense(T,T,g),Dense(T,1,softplus))

#(3.3) Value network
𝓥 = Chain(Dense(1,T,g),Dense(T,T,g),
Dense(T,T,g),Dense(T,T,g),Dense(T,1,identity))
d𝓥(t) = ForwardDiff.derivative(x->𝓥([x])[1],t)
𝚹 = Flux.params(𝓒,𝓥)

#(4) Build production function
f(x) = Α*x^α - δ*x

#(4) Build a loss function
function 𝓡(x)
    #HJB Error
    ϵ_hjb = sum((ρ*𝓥(x) - uCRRA.(𝓒(x),γ) - d𝓥.(x).*(f.(x)-𝓒(x))).^2)
    #FOC Error
    ϵ_foc = sum((𝓒(x).^(-γ) - d𝓥.(x)).^2)
    return ϵ_hjb+ϵ_foc
end

Ad1 = Optimizer(Γ=0.001)
@time 𝝝 = Adam(𝓡,𝚹,𝓚,Ad1,1,"YES","NO")

plot(𝓚',𝓥(𝓚)',title="Value Function",xlabel="K_t",ylabel="V(K),",legend=false)
plot(𝓚',d𝓥.(𝓚)',title="V_k(K)",xlabel="K_t",ylabel="V(K)",legend=false)
plot(𝓚',𝓒(𝓚)',title="Policy Function",xlabel="K_t",ylabel="C(K),",legend=false)

For replication, here is my implementation of the ADAM optimizer (FastADAM.jl file). There should’t be any problem, I used it sucessfully for many other problems.

function Adam(𝕱,𝚯,Data,Par,per,show,loc="NO")
    let
        #Initialize momentum estimates
        @unpack ϑ,Γ,β1,β2,ϵ,𝚰,𝛀,𝒯,𝓇 = Par
        n = length(𝚯)
        m = Array{Array{Float32}}(undef,n)
        v = Array{Array{Float32}}(undef,n)
        ∇ = Array{Array{Float32}}(undef,n)
        hm = Array{Array{Float32}}(undef,n)
        hv = Array{Array{Float32}}(undef,n)
        u = Array{Array{Float32}}(undef,n)
        if(loc=="YES")
            𝚴 = Array{Array{Float32}}(undef,n)
            𝕻 = [Array{Array{Float32}}(undef,n)]
            𝕸 = Array{Float32}(undef,1)
            𝖈 = Array{Int64}(undef,1)
            𝕸[1] = 𝕱(Data)
            𝖈[1] = 0
        end
        for j in 1:n
            m[j] = zeros(Float32,length(𝚯[j]))
            v[j] = zeros(Float32,length(𝚯[j]))
            ∇[j] = zeros(Float32,length(𝚯[j]))
        end
        #Main training loop
        for i in 1:𝚰
            #Sample data for stochastic gradient
            𝓓,𝓜 = size(Data)
            if(𝓓==1)
                xg = reshape(sample(Data,ϑ),1,ϑ)
            else
                𝓘 = sample(1:𝓜,ϑ)
                xg = Data[:,𝓘]
            end
            #Take gradient
            ∇𝕱 = Flux.gradient(()->𝕱(xg),𝚯)
            #Update gradient
            for j in 1:n
                #Check duality
                𝖙 = ForwardDiff.partials(∇𝕱[𝚯[j]][1])
                if(length(𝖙)==1)
                    λ(t) = ForwardDiff.value(ForwardDiff.partials(∇𝕱[𝚯[j]][t]))[1]
                    ∇[j] = λ.(1:length(𝚯[j]))
                else
                    ∇[j] = Array{Float32,1}(∇𝕱[𝚯[j]][:])
                end
                #Update momentum estimates
                m[j] = β1.*m[j] .+ (1-β1).*∇[j]
                v[j] = β2.*v[j] .+ (1-β2).*∇[j].^2
                hm[j] = m[j]./(1-β1^(i))
                hv[j] = v[j]./(1-β2^(i))
                #Update parameters
                u[j] = hm[j]./(sqrt.(hv[j]) .+ ϵ)
                𝚯[j] .= 𝚯[j] .- reshape(Γ.*u[j],size(𝚯[j]))
                if(loc=="YES")
                    𝚴[j] = copy(𝚯[j] .- reshape(Γ.*u[j],size(𝚯[j])))
                end
            end
            #Compute loss
            if(i%per==0)
                𝕷 = 𝕱(Data)
                if(show=="YES")
                    #println(i)
                    println(𝕷)
                end
                #Secundary convergence check
                if(loc=="YES")
                    if(𝕷<𝓇*𝕸[1])
                        𝕸[1] = 𝕷
                        𝕻[1] = 𝚴
                        𝖈[1] = 0
                    else
                        𝖈[1] = 𝖈[1] + 1
                    end
                    if(𝖈[1]>=𝒯)
                        for j in 1:length(𝚯)
                            𝚯[j] .= 𝚴[j]
                        end
                        println("Secundary convergence")
                        return 𝚴
                        break
                    end
                end
                if(𝕷<𝛀)
                    println("Convergence ",i)
                    break
                end
            end
        end
        #
        return 𝚯
    end
end

Sorry, I am not familiar with deep learning. As you mention, there aren’t natural boundary conditions for these kind of problems.

In practice, I find that a good initial guess matters a lot. I usually find

@article{den2015exact,
  title={Exact present solution with consistent future approximation: A gridless algorithm to solve stochastic dynamic models},
  author={Den Haan, Wouter J and Kobielarz, Michal L and Rendahl, Pontus},
  year={2015},
  publisher={Centre For Macroeconomics}
}

useful for that purpose. I would start from this and then monitor the solver.

1 Like

@Tamas_Papp Thank you, I will try that! So, from your experience Chebyshev/Spline collocation tend to converge to the right solution of HJB even without imposing boundary conditions?

Yes. Again, with a good initial guess. That’s something I would invest in.

1 Like

@Honza9723 this looks really cool.

You want to restrict the neural network to be
1: monotonically increasing
2: strictly concave
I don’t know how to do this, but if someone else does I’m all ears (@ChrisRackauckas)…

More importantly, you wanna use the state boundary condition:
Let k_{min} be the minimum point in your capital grid.
Your transition function: \dot{k}=Ak^{\alpha} -\delta k -c \Rightarrow c(k,\dot{k}) =Ak^{\alpha} -\delta k -\dot{k}
At k_{min} when \dot{k}=0: c(k_{min},0) =Ak_{min}^{\alpha} -\delta k_{min}
state constraint boundary condition: V_k(k_{min}) \geq u'(c(k_{min},0) ) = u'(Ak_{min}^{\alpha} -\delta k_{min})
I’d love to see

Btw, I solve the non-stochastic NGM using much less fancy FD methods here.

Remark in passing: if you’re solving for the value function directly (rather than the marginal value function) and if you’re going to explore the logarithmic case, you may want to change the utility function to the standard \frac{c^{1-\gamma}-1}{1-\gamma} (i.e. don’t forget the -1).

2 Likes

I’m not sure how to do it other than imposing these charcteristics via sampling in the loss. Developing a neural network which enforces these properties is probably possible though? However, strictly concave might mean the best way to do it is to just impose a multi-quadratic and learn a representation of that.

3 Likes

Maybe you can use the NN to represent the second derivative (then you just need to enforce it to be negative) and integrate it to get the result?

Something is not right…

if V is decreasing, and the error from foc is small, consumption must be negative then. What if you change \gamma to a non-integer, say 2.1? This can force consumption to be positive (otherwise the c^{-\gamma} will yield NaN) and v to be increasing.

I don’t know too much about that paper and deep learning, but here is my general conjecture:

Boundary conditions are crucial in pinning down the exact solution to the PDE. Without proper defined boundary conditions it’s not surprising that you didn’t get anything sensible.

When the PDE is solved using upwinding, it effectively imposes reflective boundaries on the grid, and it can be shown that two boundary inequalities can uniquely determine the viscosity solution to the PDE (see Ben Moll’s note).

Here are some random thoughts that most likely won’t work, but probably worth a try:

I wonder if you could impose something like reflective boundaries using neural network, say force households to consume k down if k is large enough. This may be implicitly imposed already by the production function with depreciation, but since you are also learning the foc so who knows what would happen.

Or probably you could try to approximate the transversality condition, here \lim_{k\rightarrow 0} V'(k)=eps(), with a very large \bar k and small but positive eps(). It might be very inaccurate but at least gives you the right sign of c.

Hi @RangeFu

I tried to bypass this problem by imposing a state-constraint boundary condition that would pin down value function, but this boundary condition made the training process divergent (both ADAM and standard stochastic gradient descent diverged). From my experience, after starting from random (Xavier) initialization whole process is convergend for some time, but before it would reach reasonable loss level, it starts to diverge and then it gets trapped at some high loss level.

@Albert_Zevelev I tried to overcome this problem by employing monotonic network (just use increasing activation functions, and restrict parameters to be positive eg. by applying ReLU to them), but I got exactly the same problem => non-convergence.

I guess that those problems have something to do with the (degenerate) elliptic nature of the time-independent HJB. All the successful algorithms (for example Ben Moll’s implicit upwind FD method) implements some kind of false-transient method rather than directly attacking the original non-linear elliptic equation.

1 Like

Hi everyone!

I recently started using Julia and joined this forum precisely because of this issue. I managed to solve the neoclassical growth model with Julia’s NeuralPDE. I am not sure if I have done this in an efficient manner, but it works.

Here is a link with the code and a brief note: https://github.com/GMellior/NGM_neuralnetwork

While I find that the code above worked for this specific model I have not been able to scale up to more complex versions, say by adding a diffusion in productivity or more state variables.

2 Likes

Yeah this probably needs the adaptive loss functions that we’re working on. Can you add the not working ones to the repo? @kirillzubov @zobot it would be good to as these as testing functions.

2 Likes
  1. @Goosee I was able to use (a simplified version of) your code to solve another non-linear HJB problem w/o any change in neural network tuning!
    I got it to work in minutes (my coding time, not run time)!
    The firm-investment problem is described here.
using NeuralPDE, Flux, ModelingToolkit, GalacticOptim, Optim, DiffEqFlux
using ForwardDiff
import ModelingToolkit: Interval, infimum, supremum

@parameters x # x=K
@variables u(..), s(..) # u(x)=V(K) & s(x) = I(K) policy
Dx   = Differential(x)  # Dx(u(x))  == V_K(K)
δ=0.10; ρ=0.05; z= ρ + δ +.10; Iss = (z-ρ-δ)/(ρ+δ); kss = Iss/δ;
eq = [
        Dx(u(x))-1 - s(x) ~ 0.0, 
        z*x - s(x) - 0.5*s(x)^2 + Dx(u(x))*(s(x) - δ*x) - ρ*u(x) ~ 0.0
        ]
elems = size(eq)[1]
bcs  = [Dx(u(kss)) ~ (1 + δ*kss)]
domains = [x ∈ Interval(kss*0.1,kss*1.5)]
neurons = 20
chain             = [FastChain(FastDense(1,neurons,tanh)
                    ,FastDense(neurons,neurons,softplus)
                    ,FastDense(neurons,1)) for _ in 1:elems]
dx0               = kss*(1.5-0.1)/400
initθ             = map(c -> Float64.(c), DiffEqFlux.initial_params.(chain))
discretization    = PhysicsInformedNN(chain, GridTraining(dx0),init_params =initθ)

@named pde_system = PDESystem(eq,bcs,domains,[x],[u(x),s(x)])
prob = discretize(pde_system,discretization)
const losses = []
cb = function (p,l)
    push!(losses, l)
    if length(losses)%500==0
        println("Current loss after $(length(losses)) iterations: $(losses[end])")
    end
    return false
end
maxit1 = 1200
maxit2 = 4000
res  = GalacticOptim.solve(prob, ADAM(0.06); atol=1e-7, cb = cb, maxiters=maxit1)
prob = remake(prob,u0=res.minimizer)
res  = GalacticOptim.solve(prob,BFGS(); cb = cb, maxiters=maxit2)
phi  = discretization.phi


using Plots
kgrid      = LinRange(kss*0.1,kss*1.5,1_000)
I          = Int.(collect(size(res.minimizer)./elems))
Vfunc(x)   = first(phi[1](x,res.minimizer[1:I[1]]))
Cfunc(x)   = first(phi[2](x,res.minimizer[I[1]+1:end]))

# Plot V
plot(kgrid ,Vfunc.(kgrid),
    line =:dashdot,linewidth=3,label="Neural net",
    legend=:topleft,xlabel = "k",ylabel="V(k)", 
    fg_legend = :false, xtick=0:2:10, ytick=-20:1:-8)
#display(plot!(kgrid,vfunc,label="Finite differences",legend=:topleft))

plot(kgrid ,Cfunc.(kgrid),
    line =:dashdot,linewidth=3,label="Neural net",
    legend=:topleft,xlabel = "k",ylabel="I(k)", 
    fg_legend = :false, xtick=0:2:10, ytick=0.5:0.5:2)
plot!(kgrid ,x -> Iss, lab="I_SS", ylims=(0,1))
#

# Compute HJB error
dVk = ForwardDiff.derivative.(Vfunc, Array(kgrid))
II  = dVk .- 1
HJBerror = z*kgrid .- II .- .5*II.^2 + dVk .*(II-δ*kgrid) - ρ*Vfunc.(kgrid)
plot(kgrid,HJBerror,linewidth=3,ylims=(-1.2e-4,1e-4),
    xlabel = "k",ylabel="HJB error",label="Neural net", fg_legend = :false)
########
display(plot(losses,legend = false, yaxis=:log))

The value function is affine (as it should be)


The policy function is constant (as it should be)

The HJB error

The loss

  1. @Goosee echoing @ChrisRackauckas I really hope you post issues for the problems not working…
    If you want to “scale up to more complex versions, say by adding a diffusion in productivity or more state variables.” I recommend you extend my simple firm investment problem (instead of the NCGM) so you don’t need to worry about things like diving by zero etc…
  2. If you’d like please feel free to add examples to my Github repo
1 Like

Hey! Thanks for sharing this. Of course I can upload the other one. In which repo would be best to upload the case with diffusion, in yours, mine or here or somewhere specific?

1 Like

I am now more inclined to say that the issue is getting the right boundary condition in case of the problem with diffusion. I think that we need an inequality constraint for this case and in general to solve more complex problems. However, I believe that Julia’s NeuralPDE solver cannot handle these yet.

Do you have any suggestion on how we could introduce an additional penalty in the loss, that enters when the controls generate a drift with the wrong sign at the boundaries?

You can impose an inequality constraint through the choice of the neural network architecture. Have you tried that yet?