Help avoiding perturbation confusion with symbolic forward mode auto-differentiation (how to tag dual numbers)

question

#1

Hey there, so I’m playing around with making my own symbolic math package and I’m attempting to do forward mode automatic differentiation on my symbolic functions. The problem is that I am struggling to find a clean way around a problem called perturbation confusion.

First, some background:

Forward Mode Automatic Differentiation
To solve certain math problems we’ve invented new types of numbers. One example is the imaginary numbers which include a unit imaginary i which is defined such that i^2 = -1. Similarly, to solve differentiation problems it’s useful to use what people in computing call Dual Numbers defined by a new types of number \epsilon with the algebraic property that \epsilon^2 = 0. It turns out that for a sufficiently well behaved function f, if one evaluates f(x + \epsilon)$ they will find that equals f(x) + \epsilon f'(x) where f'(x) is the first derivative of f. If one constructs in the computer a type which has the above properties and defines multiple dispatch for all the relevant mathematical functions so that they know how to deal with dual numbers then one can defined a derivative operator as

function D(f::Function)
    x -> (f(x + ϵ)).infinitesimal
end

where z.infinitesimal just extracts the terms in z which are multiplied by ϵ.

Hence, the derivative of a function f(x) is found simply by calling D(f)(x)

Where am I at?

I have managed so far to define multiple dispatch methods on most of the basic mathematical functions I care about so that they can accept symbolic arguments and dual number arguments and have (as far as I can tell so far) correct behaviour and I have a rudimentary simplification system that is able to apply mathematical identities and simplify certain symbolic mathematical expressions. Things are working pretty well for the early stages.

My dual number type currently is defined basically like

struct Dual_Number
    real::Union{Symbolic, Number}
    infinitesimal::Union{Symbolic, Number}
end

and for convenience I just define ϵ = Dual_Number(0,1)

Great so whats the problem?

Perturbation Confusion
Consider trying to take a second derivative with this system.

(D^2)(f) == D(D(f))
         == D(x -> (f(x + ϵ)).infinitesimal)
         == y -> ((x -> (f(x + ϵ)).infinitesimal)(y + ϵ)).infinitesimal
         == y -> 0

Uh-oh!

The problem here is that when I try to take a second derivative it has to go through the first derivative and so it essentially is trying to feed in to f and then take the infinitesimal part twice which is always going to be zero since the first time you extract the infinitesimal part you get just the real part which by definition has no infinitesimal part. This is known as perturbation confusion because the derivative system doesn’t know which infinitesimal part its supposed to extract.

After encountering this problem, the main strategy I’ve found in the literature for avoiding perturbation confusion ( though it’s also not perfect) has been to ‘tag’ each dual number and then only extract the dual number components you meant to extract each time you take the derivative.

ie. instead of ϵ we want to write \epsilon_1, \epsilon_2, \epsilon_3, … and then change our arithmetic rules such that \epsilon_i \epsilon_j = 0 only if i = j and otherwise we just leave \epsilon_i \epsilon_j alone.

Then if we imagine implementing the proper computer implementation of ϵᵢ that obeys the above properties then one could simply implement a derivative operator as

function D(f::Function)
   global tag += 1
    x -> extract_infinitesimal(f(x + ϵ(tag)), tag)
end

where extract_infinitesimal(z, tag) only extracts the infinitesimal part of z associated with the index tag and epsilon(tag) generates a new dual number with tag index tag. This should then be immune to most forms of perturbation confusion.


The Question

Okay, sorry for the long winded preamble. My problem is that I am really struggling to see how best to implement these tagged dual numbers. I’ve tried to look at the source code for various packages that use this tagging method and have failed to understand whats going on. My intuition would be that the Dual_Number types should have a slot that carries tag information but then I’m seeing that I’ll need a slot in the Dual_Number type that corresponds to each tag because I need to know the difference between x + 2ϵ₁ + ϵ₂ and x + ϵ₁ + 2ϵ₂. Further, there’s the problem of cross terms so I need some what of dealing with terms like x + 2ϵ₁ϵ₂ without setting that infinitesimal part to zero and have a way to extract out x + 2ϵ₁ or x + 2ϵ₂ as needed.

Maybe there is some magic one can do with Julia’s type system that generates a new type for each combination of tags and would be friendly to my notions of multiple dispatch on dual numbers but I’m really not very well versed in the subtleties of the type system.

Maybe theres a better way I haven’t thought of? If anyone has taken the time to read this and has a suggestion I’d love to hear it.


#2

Have you looked at ForwardDiff.jl’s implementation? That’s a tagged Dual implementation using the Julia type system.


#3

I’ve tried to read it and didn’t really understand what I was looking at. I’ll try again though.


#4

The logic in ForwardDiff.jl is a little complicated, but the basic idea is that the Dual type takes an extra parameter T: the function

gradient(f::F, AbstractArray{V})

internally uses a Dual{Tag{F,V}} object.

So the problem becomes then how should you combine Dual objects created with different tags? ForwardDiff.jl creates an ordering of Tags using a generated function called tagcount, which in turn defines a total ordering using the ≺/ “\prec” relation. This relation is used to define promotion to create nested Dual objects, so that the most recently created tag appears as the “outside” most tag.

Why do we want the most recently created as the outermost tag? Well, that’s because it is the first one we will want to remove, e.g. say you wanted to compute a Hessian by:

g(y) = gradient(f,y)
jacobian(g, x)

then:

  • the jacobian call would use a Tag{typeof(g), Float64} tag, and
  • the gradient call would use a Tag{typeof(f), Dual{Tag{typeof(g), Float64}}} tag (there are a couple of other parameters to Dual which I’m ignoring here).

The gradient tag depends on the jacobian tag, so will have a higher tagcount. This is useful, because it is the one whose derivative information we need to extract first.


#5

Thank you for the explanation, Simon. One of the reasons I have been avoiding ForwardDiff.jl is that it has all sorts of optimizations all over the place that make it hard for me to parse whats going on. You comment will at least give me a road map to figure out how I can handle this.

By the way, does ForwardDiff.jl suffer from the problem shown in the paper I linked above?


#6

I’m not an expert here (@jrevels might be able to help me out here), but my understanding is that ForwardDiff.jl should avoid this problem as the tags aren’t generated by the closures themselves, but by the argument types.

That said, I’m sure there are some weird ways to break it by destroying this type information (e.g. by passing a callable mutable container type or some such), but in general it is pretty robust.


#7

No, it does not. ForwardDiff doesn’t eta reduce perturbation coefficient extraction, which is the central cause of the issue described in that paper (it’s a great paper btw, happy to see it mentioned here!).

Here’s an informal verification that the problem example given in the paper can be differentiated correctly using ForwardDiff (I just used sin for f):

julia> using ForwardDiff

julia> s(u, f, x) = f(x + u)
s (generic function with 1 method)

julia> const D = ForwardDiff.derivative
derivative (generic function with 4 methods)

julia> x = rand()
0.21292855172461755

julia> D(y -> D(z -> s(0.0, sin, z), y), x) === D(y -> D(sin, y), x)
true

Cool to see more AD in the Julia world - Julia is an incredibly fun language for implementing AD!