Maliar, Maliar, and Winant using Flux.jl (I just want to write a custom objective)

Hi all,

I am trying to translate the code that you can find here from Python into Julia. Everything works swimmingly until I get to the end of the code where I have to do the training section.

I am using Flux.jl, and here it is a bit different from what I used to doing normally, because what matters for this code is not the application of the model directly, but the application of the model through the residual equations.

Basically when I try to train the Neural Net, I do not know how to get Flux.jl to recognize that I want to pick the parameter values of my model model so as to minimize my objective, and that my model needs to match both today and tomorrow. If I feed the objective in directly as loss function, then zygote has a fit. Any guidance would be much appreciated.

1 Like

I’m working on GitHub - DynareJulia/Dynare.jl: A Julia rewrite of Dynare: solving, simulating and estimating DSGE models. and would be interested in adding deep learning. Please let me know of your progress. If provide access to your code, I could try to see what the problem is

2 Likes

The code for MMW is on a different computer, but here is a simple example of the problem that I have in Flux, where I am simply training a linear regression.

n = 10000

function actual_1(x)
     return 3*x + 7
end

model2 = Dense(1 => 1)
X2 = Array{Float32}(undef,2,n)
rand!(X2)
Y2 = [actual_1.(X2[1,:]) actual_1.(X2[2,:])]'

function loss2(model, x, y) 
    pred1 = model(x[1,:]')
    pred2 = model(x[2,:]')
    return mean(abs.(pred1 .- y[1,:]) .+ abs.(pred2 .- y[2,:]))
end


train_data2 = Flux.DataLoader((X2, Y2), batchsize=64, shuffle=true)

opt2 = Flux.setup(Adam(), model2)

for epoch in 1:200
    Flux.train!(loss2, model2, train_data2, opt2)
end

In this case, the neural net for whatever reason is converging to the mean of Y2, whereas it should be converging to the true underlying linear model, which is

Y2 = 3*X2 + 7

In the case of MMW, I have an additional wrinkle in that my loss is basically the entirety of the algorithm.

Edit: forgot to add the n from the other section. Code should run through now,

1 Like

Following this thread. I don’t have anything to add right now but I’ve always been interested in new model solution methods! :+1:

Ok, here is the code for MMW. It is pretty much verbatim from the linked file. The thing that needs to be done at the end is converting the objective (which is the function Draws) into an object that Flux.jl that can use to train the model, which is used inside of the objective. If I can do that, then I don’t have to worry about the other issue, which is getting Flux to recognize that the model needs to be applied to different parts of the data.

#First we import packages
using Flux, CUDA, Random, Plots
#set model parameters
β = 0.9 #common discount rate
γ = 2.0 #CRRA coefficient
rbar = 1.04 #common interest rate
#set ar-1 parameters
#sds
σ_r = 0.001 #idiosyncratic interest rate shocks
σ_p = 0.0001 #permanent component of idiosyncratic income
σ_q = 0.001 #other portion of idiosyncratic income
σ_δ = 0.001 #discount rate shocks
#Ar-1 coefficeints
ρ_r = 0.2
ρ_p = 0.999
ρ_q = 0.9
ρ_δ = 0.2

#calculate ergodic distributions
σ_e_r = σ_r/sqrt(1-ρ_r^2)
σ_e_p = σ_p/sqrt(1-ρ_p^2)
σ_e_q = σ_q/sqrt(1-ρ_q^2)
σ_e_δ = σ_δ/sqrt(1-ρ_δ^2)

#set the bounds for the wage
w_min = 0.1
w_max = 4.0

#since we have to use KKT, we need to convert the KKT constraints into a single objective using the fischer burnmeister function
function fb(a,b)
    return a + b - sqrt(a^2 + b^2)
end

#now we use flux to construct the neural network
model = Chain(
    Dense(5 => 32, relu),
    Dense(32 => 32,relu),
    Dense(32 => 32,relu),
    Dense(32 => 2)) #the last layer we don't specify an activation function because it defaults to the identity map

function dr(r,δ,q,p,w)

    #normalize the values of the variables
    r = r/(σ_e_r/2)
    δ = δ/(σ_e_δ/2)
    p = p/(σ_e_p/2)
    q = q/(σ_e_q/2)

    #normalize income to be between -1 and 1
    w = (w.-w_min)./((w_max-w_min)*2.0).-1.0

    #construct a matrix from the entries in
    s = [r δ q p w]'

    #apply the neural net
    x = model(s)

    #apply different activation functions to keep everything nice on the outside
    ξ = sigmoid(x[1,:])
    #this keeps the consumption share of CIH in [0,1]

    h = exp.(x[2,:])

    return (ξ,h)
end

#select number of grid points
N_wage = 100

#create a grid of wages
wages =  range(w_min,w_max,N_wage)

ξ_vec, h_vec = dr(wages*0,wages*0,wages*0,wages*0,wages)

plot(wages,wages.*ξ_vec)
plot!(wages,wages)

function Residuals(e_r,e_δ,e_q,e_p,r,δ,q,p,w)

    #get the length
    n = length(r)

    #get the values of the state today
    ξ, h = dr(r,δ,q,p,w)
    c = ξ.*w

    #transitions
    r_prime = r*ρ_r + e_r
    δ_prime = δ*ρ_δ + e_δ
    p_prime = p*ρ_p + e_p
    q_prime = q*ρ_q + e_q

    #
    w_prime = exp.(p_prime).*exp.(q_prime) .+ (w.-c).*rbar.*exp.(r_prime)

    ξ_prime, h_prime = dr(r_prime,δ_prime,q_prime,p_prime,w_prime)
    c_prime = ξ_prime.*w_prime

    R1 = β*exp.(δ_prime .- δ).*(c_prime./c).^(-γ).*(rbar.*exp.(r_prime)) .- h
    R2 = fb.(ones(n)-h,ones(n)-ξ)

    return (R1, R2)
end

###Initialize distributions for the draws function
#distributions for the current states
r_dist = Normal(0,σ_e_r)
δ_dist = Normal(0,σ_e_δ)
p_dist = Normal(0,σ_e_p)
q_dist = Normal(0,σ_e_q)
w_dist = Uniform(w_min,w_max)


#distributions for the state tomorrow (1st draw)
er_dist = Normal(0,σ_r)
eδ_dist = Normal(0,σ_δ)
ep_dist = Normal(0,σ_p)
eq_dist = Normal(0,σ_q)



function Draws(n)

    #draw current states
    r = rand(r_dist,n)
    δ = rand(δ_dist,n)
    p = rand(p_dist,n)
    q = rand(q_dist,n)
    w = rand(w_dist,n)

    #first draw of epsilons for tomorrow
    er_1 = rand(er_dist,n)
    eδ_1 = rand(eδ_dist,n)
    ep_1 = rand(ep_dist,n)
    eq_1 = rand(eq_dist,n)
    
    #second draw of epsilons for tomorrow
    er_2 = rand(er_dist,n)
    eδ_2 = rand(eδ_dist,n)
    ep_2 = rand(ep_dist,n)
    eq_2 = rand(eq_dist,n)

    R1_e1, R2_e1 = Residuals(er_1,eδ_1,eq_1,ep_1,r,δ,q,p,w)
    R1_e2, R2_e2 = Residuals(er_2,eδ_2,eq_2,ep_2,r,δ,q,p,w)

    return R1_e1.*R1_e2 .+ R2_e1.*R2_e2
end

pars = Flux.params(model)

#test the draws function
Draws(10000)
#try to take gradients using zygote
Flux.gradient(Draws(1000),pars)

@JHall I struggle with the same issue at the very moment. Based on the examples that the documentation of Flux provides (Link), I rewrote your code.

What did the trick for me is that the loss function that is called during each iteration of train! holds at least the neural network as an argument, such that changes in the parameters affect the loss that is produced.

#First we import packages
using Flux, CUDA, Random, Plots, Distributions
#set model parameters
β = 0.9 #common discount rate
γ = 2.0 #CRRA coefficient
rbar = 1.04 #common interest rate
#set ar-1 parameters
#sds
σ_r = 0.001 #idiosyncratic interest rate shocks
σ_p = 0.0001 #permanent component of idiosyncratic income
σ_q = 0.001 #other portion of idiosyncratic income
σ_δ = 0.001 #discount rate shocks
#Ar-1 coefficeints
ρ_r = 0.2
ρ_p = 0.999
ρ_q = 0.9
ρ_δ = 0.2

#calculate ergodic distributions
σ_e_r = σ_r/sqrt(1-ρ_r^2)
σ_e_p = σ_p/sqrt(1-ρ_p^2)
σ_e_q = σ_q/sqrt(1-ρ_q^2)
σ_e_δ = σ_δ/sqrt(1-ρ_δ^2)

#set the bounds for the wage
w_min = 0.1
w_max = 4.0

#since we have to use KKT, we need to convert the KKT constraints into a single objective using the fischer burnmeister function
function fb(a,b)
    return a + b - sqrt(a^2 + b^2)
end

#now we use flux to construct the neural network
model = Chain(
    Dense(5 => 32, relu),
    Dense(32 => 32,relu),
    Dense(32 => 32,relu),
    Dense(32 => 2)) #the last layer we don't specify an activation function because it defaults to the identity map

function dr(model, r,δ,q,p,w)

    #normalize the values of the variables
    r = r/(σ_e_r/2)
    δ = δ/(σ_e_δ/2)
    p = p/(σ_e_p/2)
    q = q/(σ_e_q/2)

    #normalize income to be between -1 and 1
    w = (w.-w_min)./((w_max-w_min)*2.0).-1.0

    #construct a matrix from the entries in
    s = [r δ q p w]'

    #apply the neural net
    x = model(s)

    #apply different activation functions to keep everything nice on the outside
    ξ = sigmoid(x[1,:])
    #this keeps the consumption share of CIH in [0,1]

    h = exp.(x[2,:])

    return (ξ,h)
end

#select number of grid points
N_wage = 100

#create a grid of wages
wealth =  range(w_min,w_max,N_wage)

ξ_vec, h_vec = dr(model, wealth*0,wealth*0,wealth*0,wealth*0,wealth)

plot(wealth,wealth.*ξ_vec);
plot!(wealth,wealth);
display(plot!(title = "Consumption policy", xlabel = "Wealth", ylabel = "Consumption"));

function Residuals(model, e_r,e_δ,e_q,e_p,r,δ,q,p,w)

    #get the length
    n = length(r)

    #get the values of the state today
    ξ, h = dr(model, r,δ,q,p,w)
    c = ξ.*w

    #transitions
    r_prime = r*ρ_r + e_r
    δ_prime = δ*ρ_δ + e_δ
    p_prime = p*ρ_p + e_p
    q_prime = q*ρ_q + e_q

    #
    w_prime = exp.(p_prime).*exp.(q_prime) .+ (w.-c).*rbar.*exp.(r_prime)

    ξ_prime, h_prime = dr(model, r_prime,δ_prime,q_prime,p_prime,w_prime)
    c_prime = ξ_prime.*w_prime

    R1 = β*exp.(δ_prime .- δ).*(c_prime./c).^(-γ).*(rbar.*exp.(r_prime)) .- h
    R2 = fb.(ones(n)-h,ones(n)-ξ)

    return (R1, R2)
end

###Initialize distributions for the draws function
#distributions for the current states
r_dist = Normal(0,σ_e_r)
δ_dist = Normal(0,σ_e_δ)
p_dist = Normal(0,σ_e_p)
q_dist = Normal(0,σ_e_q)
w_dist = Uniform(w_min,w_max)


#distributions for the state tomorrow (1st draw)
er_dist = Normal(0,σ_r)
eδ_dist = Normal(0,σ_δ)
ep_dist = Normal(0,σ_p)
eq_dist = Normal(0,σ_q)



function Draws(model, n = 100)
    #draw current states
    r = rand(r_dist,n)
    δ = rand(δ_dist,n)
    p = rand(p_dist,n)
    q = rand(q_dist,n)
    w = rand(w_dist,n)

    #first draw of epsilons for tomorrow
    er_1 = rand(er_dist,n)
    eδ_1 = rand(eδ_dist,n)
    ep_1 = rand(ep_dist,n)
    eq_1 = rand(eq_dist,n)
    
    #second draw of epsilons for tomorrow
    er_2 = rand(er_dist,n)
    eδ_2 = rand(eδ_dist,n)
    ep_2 = rand(ep_dist,n)
    eq_2 = rand(eq_dist,n)

    R1_e1, R2_e1 = Residuals(model, er_1,eδ_1,eq_1,ep_1,r,δ,q,p,w)
    R1_e2, R2_e2 = Residuals(model, er_2,eδ_2,eq_2,ep_2,r,δ,q,p,w)

    return R1_e1.*R1_e2 .+ R2_e1.*R2_e2
end


pars = Flux.params(model)

#test the draws function
Draws(model, 10000)

opt = Flux.setup(Flux.ADAM(0.01), model);
loss(nn, x, y) = sum(Draws(nn))

# testing the loss function
loss(model, 1, 1)

#create a grid of wages
wealth =  range(w_min,w_max,N_wage)


@time for i in 1:10000
    # training the model
    Flux.train!(loss, model, [(1,1)], opt)   

    if i % 500 == 0
        # print the loss every 1000 iterations
        println("Iter: ", i, " with loss ", loss(model, 1, 1))
        
        # plot the current consumption policy
        ξ_vec, h_vec = dr(model, wealth*0,wealth*0,wealth*0,wealth*0,wealth)
        plot(wealth,wealth.*ξ_vec)
        plot!(wealth,wealth)
        display(plot!(title = "Consumption policy", xlabel = "Wealth", ylabel = "Consumption"))
    end 
end


# evaluating the loss function
loss(model, 1, 1)

# plot the computed consumption policy
ξ_vec, h_vec = dr(model, wealth*0,wealth*0,wealth*0,wealth*0,wealth)
plot(wealth,wealth.*ξ_vec);
plot!(wealth,wealth);
display(plot!(title = "Consumption policy", xlabel = "Wealth", ylabel = "Consumption"));

Compared to the Python version, the code seems very slow, which I attribute to the very ad-hoc and quite bad implementation of the code. If you want to benchmark the code and look for the bottlenecks, I am happy to learn from your results!

1 Like

The issue arises since you take the mean across arrays, not vectors. The first part of your residual is the expression

abs.(pred1 .- y[1,:])

, which is of shape n times n. Equally is then the sum of the first and the second expression (which also has n times n shape)

mean(abs.(pred1' .- y[1,:]) .+ abs.(pred2' .- y[2,:]))

Hence, you do not calculate the difference between your prediction and the true “y”, but generate a n-times-n-sized matrix that holds either the true value “y” or the predicted value (either pred1 or pred2). Taking the mean over these two objects is like taking the mean over the true “y”.

Hence, the train! command tries to fit exactly the loss you provided it with. The correct loss function should read something like

function loss2(model, x, y) 
    pred1 = model(x[1,:]')
    pred2 = model(x[2,:]')
    return mean(abs.(pred1' .- y[1,:]) .+ abs.(pred2' .- y[2,:]))
end
1 Like

Well that is a bit embarrassing, thanks for pointing it out!

There are a number of factors which might contribute to the slowness: use of Float64 inputs, slicing inputs along rows instead of columns, using non-const globals (the various _dists). For an example this large, it’s hard to know up front what the biggest bottlenecks are. To that end, I’d recommend either reducing it (e.g. benchmarking gradients for certain functions in isolation) or profiling it and seeing where the hotspots are.