Value Function Iteration on GPU



I am trying to understand how to do the very basic Value Function Iteration in an Overlapping Generations Model using the GPU. So far it works well on the CPU, but I am starting to run into trouble once I increase the number of grid points and make my model more complicated. Here is a simple example below that I would like to re-do to run on GPU. I am using JuliaPro on Windows.
I tried looking at the CUDAnative, GPUarrays, CLArrays (as well as in but I don’t understand where to start exactly and how to translate all the functions and arrays I have so they can run on GPU.
First, I create my module with all the functions and types (apologies it’s quite large):

module mymodule

using QuantEcon: tauchen

export CreateAllParams, create_nextVV, gs_age_J_r, gs_age_j_r, income, Tax

mutable struct SParams

    # demographics
    J :: Int64
    Jret  :: Int64

    # preferences
    beta :: Float64
    sigma  :: Float64
    phi  :: Float64
    gamma  :: Float64
    vartheta  :: Float64
    flatbar  :: Float64
    psi  :: Float64
    omega  :: Float64
    psiu  :: Float64

    # endowments
    rhoz  :: Float64
    sigmaz  :: Float64
    sigmaz0  :: Float64
    xij :: Vector{Float64}
    Theta  :: Float64

    # housing
    deltah  :: Float64
    kappah  :: Float64
    alpha  :: Float64
    psir  :: Float64
    Lbar  :: Float64
    deltadh  :: Float64

    # financial instruments
    qb  :: Float64
    rb  :: Float64
    iota  :: Float64
    rm  :: Float64
    lambdab  :: Float64

    # government
    tau0y  :: Float64
    tau1y  :: Float64
    mbar  :: Float64
    rhoss  :: Float64
    tauh  :: Float64

    # credit conditions
    kappam  :: Float64
    zetam  :: Float64
    lambdam  :: Float64
    lambdapi  :: Float64
    qmort  :: Float64


mutable struct SGrids

    Ht_grid :: Vector{Float64}

    n_ht :: Int64

    Pepsilon :: Matrix{Float64}
    epsilon :: Vector{Float64}

    pension_grid :: Vector{Float64}

    B_grid :: Vector{Float64}

    income_grid :: Matrix{Float64}


mutable struct SArrays

    vvv_opt_r :: Array{Float64}
    policy_opt_r :: Array{Float64}

    g_opt_r ::   Array{Int64}


mutable struct SSim

    population :: Float64
    Weight :: Array{Float64}
    A :: Array{Int64}
    Y :: Array{Int64}
    C :: Array{Float64}
    Htilde :: Array{Int64}

    Aprime :: Array{Int64}
    Htildeprime :: Array{Int64}
    Adead :: Array{Int64}

    G_r :: Array{Bool}
    Inc :: Array{Float64}


mutable struct SAll

    p :: SParams
    g :: SGrids
    a :: SArrays
    s :: SSim


function FillParams(rb = 0.03, zetam = 0.15, price=1.0, J = 60, Jret=44, beta=0.964, sigma=2.0, phi=0.16, gamma=0.8, vartheta=0.0, flatbar=7.7, psi=5.0,omega=1.015,psiu=5.0,rhoz=0.75,sigmaz=0.08,sigmaz0=0.04,xij=zeros(44),Theta=2.0,deltah=0.015,kappah=0.07,alpha=0.6,psir=0.008,Lbar=0.0,deltadh=0.22,qb=0.0,iota=0.33,rm=0.0,lambdab=0.2,tau0y=0.75,tau1y=0.151,mbar=20.0,rhoss=0.6,tauh=0.01,kappam=0.04,lambdam=0.95,lambdapi=0.4,qmort=0.0)

    p = SParams(J, Jret, beta, sigma, phi, gamma, vartheta, flatbar, psi,omega,psiu,rhoz,sigmaz,sigmaz0,xij,Theta,deltah,kappah,alpha,psir,Lbar,deltadh,qb,rb,iota,rm,lambdab,tau0y,tau1y,mbar,rhoss,tauh,kappam,zetam,lambdam,lambdapi,qmort)
    p.vartheta = 1/p.sigma
    p.qb = 1/(p.rb+1)
    p.rm = p.rb*(1+p.iota)
    for jjj = 1:p.Jret
        p.xij[jjj] = 0.4078515 + 0.0668822*jjj - 0.0012752*(jjj)^2;
        #p.xij[jjj] = 0.6078515 + 0.0668822*jjj - 0.0012752*(jjj)^2;
    p.qmort = 1-p.zetam
    p.Lbar = 0.05*p.Theta/((p.Theta/(p.alpha*price))^(1/(1-p.alpha)))

    return p


function income(j, idshock, p :: SParams)
    out = p.Theta*p.xij[j]*exp.(idshock);
    return out

function pension(idshock, p :: SParams)
    out = p.rhoss*income(p.Jret,idshock,p)
    return out

function utility(c,s, p :: SParams)
    out = (((1-p.phi)*c^(1-p.gamma)+p.phi*s^(1-p.gamma))^((1-p.vartheta)/(1-p.gamma))-1)/(1-p.vartheta);
    return out

function bequest(flat, p :: SParams)
    out = p.psi*((flat+p.flatbar)^(1-p.vartheta)-1)/(1-p.vartheta);
    return out

function Tax(y,m, p :: SParams)
    mded = min(m,p.mbar);
    T = p.tau0y*(y-p.rm*mded)^(1-p.tau1y);
    return T

function value_age_J_r(bprime, htilde, b, y, current_rent, p :: SParams)
    c = b + y - Tax(y,0,p)-p.qb*bprime-current_rent*htilde;
    flat = bprime;
    s = htilde;
    if c <= 0
        return -Inf
    elseif bprime < 0
        return -Inf
        vv = utility(c,s,p) + p.beta*bequest(flat,p);
        return vv

function value_r(j, bprime, htilde, b, y, current_rent, nextVV, p :: SParams)
    c = b + y - Tax(y,0,p) - p.qb*bprime - current_rent*htilde;
    s = htilde;
    if c <= 0
        return -Inf
    elseif bprime < 0
        return -Inf
        vv = utility(c,s,p) + p.beta*nextVV;
        return vv

function create_income_grid(n_z, epsilon, pension_grid, p :: SParams)

    income_grid = zeros(Float64,n_z,p.J)
    for j = 1:p.Jret
        for i_z = 1:n_z
            income_grid[i_z,j] = income(j,epsilon[i_z],p)
    for j = p.Jret+1:p.J
        for i_z = 1:n_z
            income_grid[i_z,j] = pension_grid[i_z]

    return income_grid


function create_nextVV(j, i_z, n_b, Pepsilon, vvv_opt_r, p :: SParams)

    VVVn = zeros(n_b)

    if j >= p.Jret
        VVVn = vvv_opt_r[i_z,:,j+1]
    elseif j == p.Jret
        VVVn = vvv_opt_r[i_z,:,j+1]
        VVVn = (Pepsilon[i_z,:]'*vvv_opt_r[:,:,j+1])'
    return VVVn

function construct_grids(n_z :: Int64, n_b :: Int64, p :: SParams)

    Ht_grid = [1.17; 1.50; 1.92]

    n_ht = length(Ht_grid)

    mc = tauchen(n_z, p.rhoz, p.sigmaz, 0, 2)
    Pepsilon = mc.p
    epsilon = mc.state_values

    pension_grid = zeros(n_z)

    B_grid = linspace(0,10,n_b)

    for i_z = 1:n_z
        pension_grid[i_z] = pension(epsilon[i_z],p);

    income_grid = create_income_grid(n_z, epsilon, pension_grid, p)

    grids = SGrids(Ht_grid, n_ht, Pepsilon, epsilon, pension_grid, B_grid, income_grid)

    return grids


function create_empty_arrays(n_z :: Int64, n_b :: Int64, p :: SParams)

    vvv_opt_r    =          Array{Float64}(n_z,n_b,p.J)
    policy_opt_r =          Array{Float64}(3,n_z,n_b,p.J)

    g_opt_r =               Array{Int64}(n_z,n_b,p.J)

    arrays = SArrays(vvv_opt_r, policy_opt_r, g_opt_r)
    return arrays


function create_sim(N,J)

    population              = N*J
    Weight                  = 1/population*ones(N,J)

    A                       = zeros(Int64, N,J)
    Y                       = zeros(Int64, N,J)
    C                       = zeros( N,J)

    Htilde                  = zeros(Int64, N,J)

    Aprime                  = zeros(Int64, N,J)
    Htildeprime             = zeros(Int64, N,J)
    Adead                   = zeros(Int64, N,J)

    G_r                     = zeros(Bool, N,J)
    Inc                     = zeros(N,J)

    sims = SSim(population,Weight,A,Y,C,Htilde,Aprime,Htildeprime,Adead,G_r,Inc)

    return sims


function gs_age_J_r(b :: Float64, y :: Float64, B_grid, n_b, Ht_grid, n_ht, current_rent, p :: SParams)

    vvv_r = zeros(n_ht, n_b)
    policy_r = zeros(3)

    for i_ht = 1:n_ht
        for i_b = 1:n_b
            vvv_r[i_ht, i_b] = value_age_J_r(B_grid[i_b],Ht_grid[i_ht],b,y,current_rent,p)
    (vvv_opt_r, idx) = findmax(vvv_r)
    (idxh, idxb) = ind2sub(vvv_r, idx)
    policy_r[1] = idxb
    policy_r[2] = idxh
    policy_r[3] = b + y - Tax(y,0,p) - p.qb*B_grid[idxb]-current_rent*Ht_grid[idxh]

    return vvv_opt_r, policy_r


function gs_age_j_r(j :: Int64, b :: Float64, y :: Float64, B_grid, n_b, Ht_grid, n_ht, current_rent, VVVn :: Array{Float64,1}, p :: SParams)

    vvv_r = zeros(n_ht, n_b)
    policy_r = zeros(3)
    for i_ht = 1:n_ht
        for i_b = 1:n_b
            vvv_r[i_ht, i_b] = value_r(j, B_grid[i_b],Ht_grid[i_ht],b,y,current_rent,VVVn[i_b],p)
    (vvv_opt_r, idx) = findmax(vvv_r)
    (idxh, idxb) = ind2sub(vvv_r, idx)
    policy_r[1] = idxb
    policy_r[2] = idxh
    policy_r[3] = b + y - Tax(y,0,p) - p.qb*B_grid[idxb]-current_rent*Ht_grid[idxh]

    return vvv_opt_r, policy_r


function CreateAllParams(n_z, n_b, N)

    params = FillParams()
    grids = construct_grids(n_z, n_b, params)
    arrays = create_empty_arrays(n_z, n_b, params)
    sims = create_sim(N, params.J)

    alloutput = SAll(params,grids,arrays,sims)

    return alloutput



Then I run the main file

using mymodule

n_b = 80
n_z = 3
n_ht = 3
N = 100

B = CreateAllParams(n_z, n_b, N)

j = 60
for i_z in 1:n_z, i_b in 1:n_b
    B.a.vvv_opt_r[i_z,i_b,j], B.a.policy_opt_r[:,i_z,i_b,j] = gs_age_J_r(B.g.B_grid[i_b],B.g.pension_grid[i_z], B.g.B_grid, n_b, B.g.Ht_grid, B.g.n_ht, 0.1, B.p)

for j =59:-1:1
    for i_z in 1:n_z, i_b in 1:n_b
        VVVn = create_nextVV(j, i_z, n_b, B.g.Pepsilon, B.a.vvv_opt_r, B.p)
        B.a.vvv_opt_r[i_z,i_b,j], B.a.policy_opt_r[:,i_z,i_b,j] = gs_age_j_r(j, B.g.B_grid[i_b], B.g.income_grid[i_z,j], B.g.B_grid, n_b, B.g.Ht_grid, n_ht, 0.1, VVVn, B.p)

Essentially, for each value of j I would like to somehow send the loops for i_z in 1:n_z, i_b in 1:n_b to gpu for parallel computation (as the operation on each grid point is quite simple), collect the results and move to next j.

Thanks for help!


If you want to use any of the array-based abstractions (CLArrays, CuArrays) you’ll need to rethink your program in terms of vectorized function calls, eg. broadcasting the gs_age_j_r over your input arrays. If the structure of your loop does not allow that, you can look at CUDAnative to write kernels where you use/convert the thread indexing intrinsics (threadidx, blockdim, blockidx, griddim) to appropriate indices for your algorithm. Either way, you’ll need to get rod of the loops and express them in terms of the model of parallelism you choose.


Thanks for the reply @maleadt. Let me see if I understand correctly. So function gs_age_j_r, for example, accepts as argumets individual points from the grid (for example b and y are taken from the grid). You’re saying that I need to find a way to calculate that function where I supply all the grid at once (in this case B_grid and income_grid instead of b and y)? Is that correct?

And do you happen to have some simple example on how to write kernels in CUDAnative, except for the one in the documentation?



Yes. The broadcasting machinery will apply the underlying function to scalar elements just as it happens now, but with the parallelism encoded in the vectorized application. CuArrays can work with this, and launch kernels appropriately.

Look for books on GPU/CUDA programming in general, CUDAnative isn’t that much different in terms of abstraction. Working at this level is much more bothersome though, but gives you flexibility to express parallelism outside of the boundaries set by map/broadcast.


hey @myroslav not sure if you ever continued on this, but have a look at that:

the kernel is written for you :slight_smile:

let me know if you do more on this, I wanted to spend some time on this as well. but then… :wink: