Directly modifying partials in ForwardDiff

Hi all,

I am currently trying to optimize a package I’ve written to model stellar evolution. Just in case I prefer to give some context to know if there might be a much better way to do what I’m trying to achieve. Stellar evolution is studied (almost always) as a one dimensional spherically symmetric problem. At each cell we have a set of properties such as the temperature and pressure at a given mass coordinate, with one differential equation for each. Describing it very loosely, the usual way in which this problem is treated is by taking all these variables in a single vector, for a simple case with just temperature and pressure for instance we have

\vec{x}=(T_1, P_1, T_2, P_2,\dots, T_n, P_n)

where n is the number of radial cells. The set of equations are then written as a vector function,

\vec{f}(\vec{x})=(f_{T,1}(\vec{x}),f_{P,1}(\vec{x}), f_{T,2}(\vec{x}),f_{P,2}(\vec{x}),\dots,f_{T,n}(\vec{x}),f_{P,n}(\vec{x}))

where a solution requires \vec{f}(\vec{x})=\vec{0}. Given an initial guess for \vec{x}, the problem is then solved using a Newton-Rhapson solver, which requires the calculation of the Jacobian of \vec{f}. Each of the differential equations only depends on its neighboring cells, so the Jacobian has a nice block tridiagonal structure (such as the unrealistic example below with 5 cells).

J=\left(\matrix{ D_1 & U_1 & & & \\ L_2 & D_2 & U_2 & & \\ & L_3 & D_3 & U_3 & \\ & & L_4 & D_4 & U_4 \\ & & & L_5 & D_5 }\right)

The way I am approaching this right now is computing each row of blocks in the Jacobian in their own using ForwardDiff. So I’m computing all differential equations corresponding to a cell and computing their partials with respect to the other variables in the current and neighboring cell (the edges are treated slightly different).

This is working pretty nicely, and already managed to perform some simple simulations. But there are two things bothering me

  • Various intermediate results, such as nuclear reaction rates, depend only on local properties. Because of that it is wasteful to compute their partial derivatives with respect to the properties of neighboring cells.
  • While computing each row of blocks in the Jacobian, I have to compute some of these intrinsic properties for the current and neighboring cells. When I compute the row of blocks that follows, I could reutilize all that information, but instead it just goes to waste right now and I recompute everything. So a heavy part of the code is being computed 3 times for the sake of code simplicity.

One solution I see for this would be that before I use ForwardDiff.jacobian, I pre-compute all of these intrinsic properties with ForwardDiff including only the variables I know they depend on. While computing the Jacobian, I could then unload these pre-computed values into the partials attribute of a dual vector and perform the rest of the calculations normally. But trying to do that I found out I cannot simply edit the values of an instance of ForwardDiff.Partials. Is there any simple way to directly modify a dual object to specify its partials inside a function where a Jacobian is being computed with ForwardDiff?

Note that the differential equation and nonlinear solvers will handle the details automatically if you pass in the sparsity pattern. See:

I haven’t written a similar tutorial on NonlinearSolve.jl yet, but it’ll also automatically specialize the Jacobian calculation on the sparsity pattern in the same way.

If you want to do this more manually, the color differentiation in GitHub - JuliaDiff/SparseDiffTools.jl: Fast jacobian computation through sparsity exploitation and matrix coloring is precisely the specialization. But of course NonlinearSolve and DifferentialEquations.jl are doing a few more sparsity tricks than just that part so I’d just recommend passing the solver the sparsity pattern.

The optimal fusion of the direction derivatives is precisely the result of graph coloring on the row graph, and maximum(colorvec) is the minimum number of directional derivatives required to compute the full set.

Indeed I’m trying to do things more manually, partly cause I want these bits to be parsed by students without hiding the core of the solver (for educational purposes). But also partly cause I want to be sure I have some fine-grained control on some things (both on the calculation on the Jacobian as well as the Newton-Rhapson iterations). I do realize I might be trying to reinvent the wheel here.

In particular I though the optimization I wanted to do was not really possible with coloring. To try and be more concrete, here is a MWE (not as minimal as I’d like though, sorry if its a bit too much). This just computes a set of equations that give a tridiagonal block Jacobian filled with ones.

using BlockBandedMatrices, ForwardDiff, Base.Threads

nz = 4 # number of zones
nvars = 2 # number of variables

## Create a block-tridiagonal matrix to store the Jacobian
l, u = 1, 1  # block bandwidths
N = M = nz  # number of row/column blocks
cols = rows = [nvars for i = 1:N]  # block sizes
jacobian = BlockBandedMatrix(Zeros(sum(rows), sum(cols)), rows, cols, (l, u))
# This looks like this
# 4×4-blocked 8×8 BlockBandedMatrix{Float64}:
#  0.0  0.0  │  0.0  0.0  │   ⋅    ⋅   │   ⋅    ⋅ 
#  0.0  0.0  │  0.0  0.0  │   ⋅    ⋅   │   ⋅    ⋅ 
#  ──────────┼────────────┼────────────┼──────────
#  0.0  0.0  │  0.0  0.0  │  0.0  0.0  │   ⋅    ⋅ 
#  0.0  0.0  │  0.0  0.0  │  0.0  0.0  │   ⋅    ⋅ 
#  ──────────┼────────────┼────────────┼──────────
#   ⋅    ⋅   │  0.0  0.0  │  0.0  0.0  │  0.0  0.0
#   ⋅    ⋅   │  0.0  0.0  │  0.0  0.0  │  0.0  0.0
#  ──────────┼────────────┼────────────┼──────────
#   ⋅    ⋅   │   ⋅    ⋅   │  0.0  0.0  │  0.0  0.0
#   ⋅    ⋅   │   ⋅    ⋅   │  0.0  0.0  │  0.0  0.0

## Make a mock set of equations to be solved for each row
function eval_cell_eqs!(y, x, k, nz)
    # k is the cell for which I am computing the equations
    # y = [y_1,y_2] are the discretized equations I want to solve.
    # x are the independent variables in the current and neighboring cells,
    # for example for k=2 I would have [T1,P1,T2,P2,T3,P3]. For edge cases
    # I only have the relevant subset of variables, eg for k=1, x=[T1,P1,T2,P2]
    if (k==1 || k==nz)
        y[1] = x[1]+x[2]+x[3]+x[4]
        y[2] = x[1]+x[2]+x[3]+x[4]
    else
        y[1] = x[1]+x[2]+x[3]+x[4]+x[5]+x[6]
        y[2] = x[1]+x[2]+x[3]+x[4]+x[5]+x[6]
    end
end

## Create a function that computes a single row of the jacobian in-place
function eval_jacobian_row!(xvalues, yvalues, jacobian, k, nz, nvars)
    # we construct a view for the segment of xvalues and yvalues we need for cell k
    # we do the same for the jacobian
    ki = 0
    kf = 0
    if k==1
        ki = nvars*(k-1)+1
        kf = nvars*(k+1)
    elseif k==nz
        ki = nvars*(k-2)+1
        kf = nvars*(k)
    else
        ki = nvars*(k-2)+1
        kf = nvars*(k+1)
    end

    jac_view = view(jacobian,nvars*(k-1)+1:nvars*k,ki:kf)
    yvalues_view = view(yvalues, nvars*(k-1)+1:nvars*k)
    xvalues_view = view(xvalues, ki:kf)

    ForwardDiff.jacobian!(jac_view, (y,x)->eval_cell_eqs!(y, x, k, nz), 
                        yvalues_view, xvalues_view)
end

## This function parallelizes the calculation of each row
function eval_jacobian!(xvalues, yvalues, jacobian, nz, nvars)
    Threads.@threads for k in 1:nz
        eval_jacobian_row!(xvalues, yvalues, jacobian, k, nz, nvars)
    end
end

##initialize xvalues, yvalues, and compute the Jacobian
xvalues = rand(nz*nvars)
yvalues = zeros(nz*nvars)
eval_jacobian!(xvalues, yvalues, jacobian, nz, nvars)
jacobian
# This one looks like this:
# 4×4-blocked 8×8 BlockBandedMatrix{Float64}:
#  1.0  1.0  │  1.0  1.0  │   ⋅    ⋅   │   ⋅    ⋅ 
#  1.0  1.0  │  1.0  1.0  │   ⋅    ⋅   │   ⋅    ⋅ 
#  ──────────┼────────────┼────────────┼──────────
#  1.0  1.0  │  1.0  1.0  │  1.0  1.0  │   ⋅    ⋅ 
#  1.0  1.0  │  1.0  1.0  │  1.0  1.0  │   ⋅    ⋅ 
#  ──────────┼────────────┼────────────┼──────────
#   ⋅    ⋅   │  1.0  1.0  │  1.0  1.0  │  1.0  1.0
#   ⋅    ⋅   │  1.0  1.0  │  1.0  1.0  │  1.0  1.0
#  ──────────┼────────────┼────────────┼──────────
#   ⋅    ⋅   │   ⋅    ⋅   │  1.0  1.0  │  1.0  1.0
#   ⋅    ⋅   │   ⋅    ⋅   │  1.0  1.0  │  1.0  1.0

Now the problem I have is that in the evaluation of the equations, I need to call some expensive functions, ie.

## Make a mock set of equations to be solved for each row
function eval_cell_eqs!(y, x, k, nz)
    # k is the cell for which I am computing the equations
    # y = [y_(1,k),y_(2,k)] are the discretized equations I want to solve.
    # x are the independent variables in the current and neighboring cells,
    # for example for k=2 I would have [T1,P1,T2,P2,T3,P3]. For edge cases
    # I only have the relevant subset of variables, eg for k=1, x=[T1,P1,T2,P2]
    if (k==1 || k==nz)
        expensive1 = expensive(x[1],x[2])
        expensive2 = expensive(x[3],x[4])
        y[1] = expensive1+expensive2
        y[2] = expensive1+expensive2
    else
        expensive1 = expensive(x[1],x[2])
        expensive2 = expensive(x[3],x[4])
        expensive3 = expensive(x[5],x[6])
        y[1] = expensive1+expensive2+expensive3
        y[2] = expensive1+expensive2+expensive3
    end
end

function expensive(x1,x2)
    return sin(x1)+cos(x2) #not really expensive
end

In this case I want to prevent the expensive function being called to compute the partial derivatives with respect to the unrelated variables. My understanding was that coloring would not help me here, as the sparsity pattern is the same. So my plan was to precompute them using ForwardDiff and then manually put their values into Dual numbers (for which I’d use PreallocationTools to save on allocation).

With ForwardDiff and setting up the color vectors, you have 4 colors and so you can use a single call to ForwardDiff to calculate the whole Jacobian only evaluating the expensive primal once (and its derivative 4 times, but on the vector functions). Its not exactly optimal, there’s exactly 4 0’d spots calculated unnecessarily, but if I’m understanding correctly that does about as well as the optimization that you’re describing and the only improvement I can see would be going to the D* algorithm.

In any case, you can just check how the partials chunking is done here https://github.com/JuliaDiff/SparseDiffTools.jl/blob/master/src/differentiation/compute_jacobian_ad.jl#L44-L46 to see how you’d do that by hand.

I’ll dig into those lines, thanks for the pointer!

I also tried out a similar example with SparseDiffTools, where the function here produces the same block tridiagonal structure filled with ones.

using BlockBandedMatrices, ForwardDiff
function f(y,x,nz,nvars)
  for i in 1:nz
    for l in 1:nvars
        index = (i-1)*nvars
        y[index+l] = 0
        if i==1
            for j in 1:2*nvars
                y[index+l] = y[index+l] + x[j]
            end
        elseif i==nz
            for j in 0:2*nvars-1
                y[index+l] = y[index+l] + x[end-j]
            end
        else
            for j in (i-2)*nvars+1:(i+1)*nvars
                y[index+l] = y[index+l] + x[j]
            end
        end
    end
  end
  nothing
end

## construct sparse jacobian matrix
#using Symbolics
nz = 4 # number of zones
nvars = 2 # number of variables
input = rand(nz*nvars)
output = similar(input)
#sparsity_pattern = Symbolics.jacobian_sparsity((y,x)->f(y,x,nz,nvars),output,input)
#jac = Float64.(sparsity_pattern)
# Create a block-tridiagonal matrix to store the Jacobian
l, u = 1, 1  # block bandwidths
N = M = nz  # number of row/column blocks
cols = rows = [nvars for i = 1:N]  # block sizes
jacobian = BlockBandedMatrix(Zeros(sum(rows), sum(cols)), rows, cols, (l, u))

## get colorvec of jac
using SparseDiffTools
colors = matrix_colors(jacobian)

## compute jacobian
equ = (y,x)->f(y,x,nz,nvars)
using BenchmarkTools
@benchmark forwarddiff_color_jacobian!($jacobian, $equ, $input, colorvec = $colors)

This does perform a bit better (~30% faster) compared to my example above when ran serially, but in the way I parallelize over each row of blocks I can outperform it with a few cores. Though I did not try to optimize the example with SparseDiffTools much, maybe it could do much better.

Digging a bit more I did find a (bit dirty) way to perform the optimization I was thinking of, by making use of PreallocationTools.jl. I can precompute the intermediate quantities and their derivatives (only with respect to the relevant variables), copy the results in the right place of the du and dual_du fields of a DiffCache, and then load them up with get_tmp inside the jacobian calculation. Only problem is that I cannot use chunking in the jacobian calculation, which is not very good for memory usage when I have a large number of variables. And in that regard, I can’t manage to find a way to get ForwardDiff.jacobian! to work without allocations. Particularly if I force it to use a chunk size equal to the number of derivatives, it ends up allocating as much memory as the Jacobian itself. Like the below example which just gives asquare Jacobian filled with ones:

using ForwardDiff, BenchmarkTools
nvars = 50
xvalues = rand(nvars)
yvalues = similar(xvalues)
jacobian = zeros(nvars,nvars)
function f(y,x)
    for i in eachindex(y)
        for j in eachindex(x)
            y[i] += x[j]
        end
    end
    nothing
end
xvalues = rand(nvars)
yvalues = similar(xvalues)

jac_cfg = ForwardDiff.JacobianConfig(f, xvalues, yvalues, ForwardDiff.Chunk{nvars}())

@benchmark ForwardDiff.jacobian!(jacobian, f,  yvalues, xvalues, jac_cfg) 
@show Base.summarysize(jacobian)

this one always gives me one big allocation of about the size of the jacobian, while I thought using jacobian! would make the operation in-place and prevent that.

How big is your function? FastDifferentiation.jl is good at sparse Jacobians, so long as the number of operations in your system isn’t too big (and so long as you don’t have to differentiate through conditionals on input variables.

Conditionals on input variables are quite common in my use case (though I see there is experimental support being developed). Not sure what counts as a large number of operations in this context though, at least the example shown is far less complex that what I’m actually using. In any case this might be applicable as a fast method in some cases with simplified input physics, but would still need to rely on something else for the more general case.

Large depends on the derivative. It’s more a limitation of the LLVM compiler than FD. If your Jacobian has more than say 300k nonconstant terms then LLVM compilation time can be long.

If your Jacobian is constant then you should be able to go to dense Jacobians at least 10,000x10,000 and still have LLVM compilation times on the order of a few seconds. You should be able to go to similar sparse matrix sizes, i.e., 1e8 non-zero elements, with similar timings.

The optimizations that make this possible are on a dev branch which I’ll be releasing in a few days.

I have a few example problems that go this big which I’ll be benchmarking to determine the FD symbolic preprocessing time. These timings also in a few days.

Conditionals are next on the queue. Should have a better idea how long this will take in a couple of weeks.

FD seems to be really fast, at least for the toy problem I have here. The below example produces the block tridiagonal structure I have mentioned before, here there are nz blocks in the diagonal and each block is of size nvars x nvars.

##
using FastDifferentiation, BenchmarkTools
function f(x::Vector{<:TT},i,l,nz,nvars) where{TT}
    y::TT = 0
    if i==1
        for j in 1:2*nvars
            y = y + x[j]
        end
    elseif i==nz
        for j in 0:2*nvars-1
            y = y + x[end-j]
        end
    else
        for j in (i-2)*nvars+1:(i+1)*nvars
            y = y + x[j]
        end
    end
    return y
end
nz = 20
nvars = 20
xvec = make_variables(:x, nz*nvars)
functions = Vector{FastDifferentiation.Node}(undef,nz*nvars)
for i in 1:nz
    for l in 1:nvars
        functions[(i-1)*nvars+l] = f(xvec,i,l,nz,nvars)
    end
end

##
@time jac = jacobian(functions,xvec)

##
res = similar(jac, Float64)
fast_jac = make_function(jac,xvec)

##
xvalues = rand(nz*nvars)
fast_jac(xvalues, res)
@benchmark fast_jac($xvalues, $res)

In this case the function just returns a matrix filled with ones in the right place, but performace is pretty similar if I actually do a small operation with a non-trivial derivative. After the initial few second compilation cost, this comes out about 20 times faster than the above example with SparseDiffTools. Making the jacobian sparse (just using sparse_jacobian) gets me stuck at fast_jac, I presume due to compilation. With the dense jacobian I do end up getting about 1.22 MiB in allocations, don’t know if there’s a way to drop that near zero.

A more realistic situation for my problem is at least nz=1000 and nvars=20, which means a jacobian of size (nz*nvars)^2 with about 3*nz*nvars^2 non-zero entries. So compilation time might be a problem. I could instead do something similar to what I tried initially, computing each row of blocks independently. The function applied for each is the same, restricted to a subset of the data, so I would only need 3 compiled jacobian functions (including the edge cases).

So this looks like a really interesting option once conditionals are supported. Would need to see how it behaves in the more complex scenario I will apply it to. Some of the operations that will be performed on the input vector involve things like interpolating over pre-loaded tables and running some root solvers or other types of iterative calculations.

I modified your code to use the most efficient options for FD:

using FastDifferentiation
using BenchmarkTools

function f(x::Vector{<:TT}, i, l, nz, nvars) where {TT}
    y::TT = 0
    if i == 1
        for j in 1:2*nvars
            y = y + x[j]
        end
    elseif i == nz
        for j in 0:2*nvars-1
            y = y + x[end-j]
        end
    else
        for j in (i-2)*nvars+1:(i+1)*nvars
            y = y + x[j]
        end
    end
    return y
end

function time_FD()
    nz = 20
    nvars = 20
    xvec = make_variables(:x, nz * nvars)
    functions = Vector{FastDifferentiation.Node}(undef, nz * nvars)
    for i in 1:nz
        for l in 1:nvars
            functions[(i-1)*nvars+l] = f(xvec, i, l, nz, nvars)
        end
    end

    ##
    @time jac = jacobian(functions, xvec)

    ##
    res = similar(jac, Float64)
    fast_jac = make_function(jac, xvec, in_place=true, init_with_zeros=false)

    # println(fast_jac)

    xvalues = rand(nz * nvars)
    fast_jac(res, xvalues)
    println("Dense Jacobian, in_place=true init_with_zeros=false")
    display(@benchmark $fast_jac($res, $xvalues))

    sparse_jac = sparse_jacobian(functions, xvec)
    println("done with sparse_jacobian")
    res = similar(sparse_jac, Float64)

    fast_sparse = make_function(sparse_jac, xvec, in_place=true)

    println("Sparse Jacobian, in_place=true\n\n")
    @show @benchmark $fast_sparse($res, $xvalues)
end

Here are benchmark results:

julia> time_FD()
Dense Jacobian, in_place=true init_with_zeros=false

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   9.700 μs … 47.600 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     10.200 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.362 μs ±  1.390 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     █▄        ▁
  ▂▄▁██▁▄▆▁▇▅▁▄█▁▇▃▁▃▄▁▂▂▁▁▂▁▂▁▁▁▂▁▁▁▁▂▂▁▁▂▁▂▂▁▂▂▁▁▂▁▁▂▁▂▂▁▁▂ ▃
  9.7 μs          Histogram: frequency by time        13.6 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.



Sparse Jacobian, in_place=true

BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.111 μs …  5.622 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.156 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.167 μs ± 74.822 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                █    ▄
  ▂▁▁▁▃▁▁▁▁▆▁▁▁▁█▁▁▁▁█▁▁▁▁▂▁▁▁▂▁▁▁▁▂▁▁▁▁▄▁▁▁▁█▁▁▁▁▂▁▁▁▁▂▁▁▁▂ ▂
  2.11 μs        Histogram: frequency by time        2.24 μs <

 Memory estimate: 32 bytes, allocs estimate: 2.

Allocations are zero for the dense case. Not quite sure where those 32 bytes are being allocated for the sparse case, I would expect that to be zero.

These timings are on a branch I will be merging and releasing today or tomorrow.

When you say " 3*nz*nvars^2 non-zero entries" do you mean these entries are all non-constant expressions or could some of them, maybe even most of them, be constants?

If they are all non-constant expressions then for nz=1000 and nvars=20 that would translate to about 1 million lines of code.

I think FD can do this relatively quickly but LLVM would definitely choke.

However, if many of those elements would be constants then the new soon to be released code should generate many fewer lines of code. And I can add a few more heuristics to minimize this even more.

I am working on something called rerolling, which attempts to convert straight line code back into loops. That could also greatly reduce code size in cases where there are many non-constant expressions generated by a consistent indexing scheme.

But that will come after conditionals. So at least several months down the road.

Almost none of them will be constants in my use case. And although problems with size nz~1000 and nvars~20 will be the most common use, there will be people who will want to use nvars~100-300. But I do not need to compute the entire Jacobian at once, I can separate into blocks of size 3*nvars^2, each containing a lower, a diagonal and an upper block (edge cases are a bit different). All these blocks can recycle the same function (I think), so that would be faster. And then I can also easily parallelize over all these blocks. That would still have some redundant calculations, as quantities computed for one block of the jacobian are also needed in the following and previous block. But since FastDifferentiation is showing such a large speedup here one could potentially just ignore that.

Real problems help guide the development of FastDifferentiation in the most productive way. Do you have an example in mind that works on this larger problem size that might be of use to you?

I could help you get the fastest possible results and extend FastDifferentiation where necessary.

I can try and cookup a more complex example, but it will take me a few days to get to it. I also did manage to get an implementation of my toy problem that lowers memory allocation by a factor of ~50 by manually creating ForwardDiff.Dual numbers and operating on them rather than relying on the ForwardDiff.jacobian implementation.

Just a minimal example of using Duals directly in case anyone is looking for that:

using ForwardDiff, BenchmarkTools
a = ForwardDiff.Dual(rand(), ForwardDiff.Partials((1.0,0.0,0.0,0.0)))
b = ForwardDiff.Dual(rand(), ForwardDiff.Partials((0.0,1.0,0.0,0.0)))
c = ForwardDiff.Dual(rand(), ForwardDiff.Partials((0.0,0.0,1.0,0.0)))
d = ForwardDiff.Dual(rand(), ForwardDiff.Partials((0.0,0.0,0.0,1.0)))

a*b # we can operate on duals directly

x = [a,b,c,d]
y = similar(x)

##
function eval_sums_inplace(y, x)
    y[1] = x[1] + x[2] + x[3] + x[4]
    y[2] = x[1] - x[2]
    y[3] = 0
    y[4] = -x[3] - x[4]
    return y
end

##
@benchmark eval_sums_inplace($y,$x)

By the end in here all entries of the Jacobian are contained in y[i].partials