AutoDiff, handling of perturbation collision

Hi everyone,

I am developping my own flavour of automatic differentiation using Dual numbers. First - why would I ever do that when so much work has been put into ForwardDiff?

ForwardDiff works by transforming functions. You use it with calls like

∇f = ForwardDiff.gradient(f)
slope = ∇f(x)

Totally nifty but this seems to limit the API to “single input single output” functions. One could probably workaround this with closures, but ain’t that awkward. I would have loved to contribute an extension of the API, but I am so out of my depth with the ForwardDiff code that I am afraid it won’t be me…

The underlying dual numbers do not have this limitation though. In my own little implementation a typical usage is (computing the imbalance vector of a finite element)

        Δ   = δ(y) # perturbation here
        R,χ = balance(e,t,y+α*Δ,y∂t+β*Δ,zeros(vsize(y)),χo,dbg)
        return value(R),∂(R),χ

Here ∂(R) is αdR/dy + βdR/dy’ = αK+βC with α=1 and β=1/Δt delivering the goods for an implicit Euler time-stepping. And anyway, there’s always the value of learning something!

I got it all working, including nested derivatives, but precisely because multiple-argument functions may be differentiated, I open the door for perturbation collisions. I know how to handle that from a mathematical point of view, to each point in my code that introduces a perturbation, I must associate a “tag”. When two duals meet they need to compare tags to decide who rides on top.

This leaves me with two questions: how to generate tags and how to compare tags at compile time.

When I try comparing tags like this

combine(a::dR{Ta},b::dR{Tb}) where{Ta<Tb} = println("b rides on top")

Julia gives me a frown.

So I go spying in the ForwardDiff code, but remain largely baffled after multiple parsings. In Dual.jl, in struct Dual{T,V,N} <: Real clearly T is the tag. The operator is used to compare tags, but apart from the method ≺(a,b) = throw(DualMismatchError(a,b)), where is defined?

The operator is used in promote rules - not, as I find, in binary operations:

Base.@pure function Base.promote_rule(::Type{Dual{T1,V1,N1}},
                                      ::Type{Dual{T2,V2,N2}}) where {T1,V1,N1,T2,V2,N2}
    # V1 and V2 might themselves be Dual types
    if T2 ≺ T1
        Dual{T1,promote_type(V1,Dual{T2,V2,N2}),N1}
    else
        Dual{T2,promote_type(V2,Dual{T1,V1,N1}),N2}
    end
end

Does this resolve at compile time? Thanks to greedy evaluation?

Am I right in assuming that the strategy here is to promote all arguments before binary operations which are defined for arguments with the same stamp? (as opposed to creating binary operations that tackle various types)?

Finally, tag generation in Derivatives.jl

T = typeof(Tag(f, R))

I do not find where Tag is defined. Am I right to assume that the inputs f and R are a way to identify a perturbation?

If you have read that far: thank you! :grin:

Philippe

1 Like

I think that developing a simple layer that concatenates/splits vectors as needed for input/output is orders of magnitude easier than replicating ForwardDiff.jl. A lot of work and testing went into that package.

Hi Tamas,
You are right, I am sure. But that is not something I am able to do myself - in spite of the time I spent trying to understand the ForwardDiff code.

But we could say such an API is a request from me to the ForwardDiff team - and I would be glad to give my view of what the API could look like. And I would love to retrofit ForwardDiff into my little FEM code.

I don’t think I can offer insight into your actual question, but for the sake of better understanding the problem with ForwardDiff, is it that it’s built on the idea of single arguments functions? And the inconvenience of closures you mentioned would be the following:

# some 2-vector-arg function I want the gradient of
q(a, b) = a + b
# a 1-arg wrapper (lets assume a,b have length 3)
f(x) = q(view(x, 1:3), view(x, 4:end))

If so, it seems like you wouldn’t necessarily need to engage much with the existing forwarddiff code in order to write an extension that performs the join-then-split operations necessary. One way would be that any >1 argument call to the gradient would first create the necessary closure, send that to the normal api, and then also unwrap the result at the end. Surely this would be easier than redoing the work from scratch?

1 Like

I agree, I have been pondering this and wrappers would do the trick, as in, I can make it work. But I will put another requirement on the code: it should be expressive, and there the wrapper scores a little lower. But an enhanced ForwardDiff.jl API would be great.

And again - me implementation of forward diff was for the heck of it, and I can recommend the exercise to anyone interested in understanding why Julia rocks! :grinning:

2 Likes

I’ve got a couple approaches that might help.

  1. Using ArrayPartition from RecursiveArrayTools.jl. The downside of this approach is all of your inputs have to be arrays and you function has to be (possibly) rewritten to handle that. There might be some clever ways to work around this restriction.
using RecursiveArrayTools
using ForwardDiff

function gradient(f)
    eff(args) = f(args.x...)
    return (args...) -> ForwardDiff.gradient(eff, ArrayPartition(args...)).x
end

f(x,y) = sin(x[1])*exp(y[1])
∇f = gradient(f)

gradf = ∇f([3],[4])
#([-54.051758861078156], [7.704891372731154])
  1. Using ComponentArrays.jl will lift the requirement of needing all array inputs, but your function will have to use keyword arguments. The downside of this approach is that the ComponentArray keyword-style constructor is a pretty slow still, and that will slow down the whole calculation. I think there are ways to work around the issue, I’d just have to think about it.
using ComponentArrays
using ForwardDiff

function gradient(f)
    function eff(args)
        k = keys(args)
        v = (args[key] for key in k)
        return f(; (key=>val for (key,val) in zip(k, v))...)
    end
    return (;args...) -> ForwardDiff.gradient(eff, ComponentArray(;args...))
end

f(;x,y) = sin(x)*exp(y)
∇f = gradient(f)

gradf = ∇f(x=3, y=4)
# ComponentVector{Float64}(x = -54.051758861078156, y = 7.704891372731154)

The elements can be then accessed via gradf.x or gradf.y. This also works with having array elements.

1 Like

Thanks!

My view on this is the same as to tomerarnon - I am sure this will work, but the API won’t exactly be neat.

Now I know about RecursiveArrayTools.jl and ComponentArrays.jl, though!

1 Like

There was once movement to move out ForwardDiff.Dual as a DualNumbers.jl replacement

https://github.com/JuliaDiff/DualNumbers.jl/issues/45

So perhaps that’s a better way to go

A ForwardDiff.Tag{Function,ValueType} generates an unique UInt value for each combination of types, and the comparison of two tags is the comparison of the values generated.
this is the code used in ForwardDiff for tag creation, in config.jl:

struct Tag{F,V} #just for type stored
end

const TAGCOUNT = Threads.Atomic{UInt}(0) #count of created tags

# each tag is assigned a unique number
# tags which depend on other tags will be larger
# tagcount generates an unique number for each combination of types
@generated function tagcount(::Type{Tag{F,V}}) where {F,V} 
    :($(Threads.atomic_add!(TAGCOUNT, UInt(1))))
end

function Tag(f::F, ::Type{V}) where {F,V}
    tagcount(Tag{F,V}) # trigger generated function
    Tag{F,V}()
end




@inline function ≺(::Type{Tag{F1,V1}}, ::Type{Tag{F2,V2}}) where {F1,V1,F2,V2}
    tagcount(Tag{F1,V1}) < tagcount(Tag{F2,V2})
end
1 Like

Sorry for my late reply, I am just back from a holiday. Thank you longemen, neat and tidy answer!

So for each type we generate a tagcount function that returns a constant - the counter, and the function calls this generated function, got it.

And one big trick in this code is the use of “atomic”. For one, this makes the code thread-safe. But most important in this setting TAGCOUNT is a const, yes, but it is a container, and its content can be muted (here: by atomic_add!).

The other remark is the use of F,V to uniquely identify a tag. Now this brings us back to the start of the discussion: how easily can ForwardDiff be adapted to not operating only on a function. That function’s identity is used as part of the tag’s identity. This will not work with my idea of doing forward differentiation operating on an arbitrary segment of code.

I believe (read: no proof is offered) that I would get away with it if a tag is uniquely created at a point within my code ). Turns out it’s easy (plundering with thanks):

const TAGCOUNT = Threads.Atomic{UInt}(1) #count of created tags

macro tagline()
    return :($(Threads.atomic_add!(TAGCOUNT, UInt(1))))
end

for i = 1:3
    @show @tagline()
    @show @tagline()
end

It looks like I might have a plan!!! :grin:

1 Like

Still pondering this, and I begin to understand the design of ForwardDiff.jl centered on functions.

I made a system where the tag of a dual number (I call them Adiff here) is associated to the position in the code. Typestable, all compile time, I am begining to get the hang of Julia.

## Design
const TAGCOUNT = Threads.Atomic{Int}(1)
macro tag()
    return :(Val($(Threads.atomic_add!(TAGCOUNT, Int(1)))))
end

struct Adiff{T}
    x ::Float64
    dx::Float64
end
@inline variate(tag::Val{T},x) where{T}= Adiff{T}(x,1.)
@inline @generated binaryop(A::Adiff{T1},B::Adiff{T2}) where{T1,T2} = T1<T2 ? :(A) : :(B) # place holder for binary operations


## Test

# works as advertised
for i = 1:3
    X = variate(@tag,2. *i)
    @show X
end
Y = variate(@tag,8.)
Z = variate(@tag,9.)
A = binaryop(Y,Z)
@show Y
@show Z
@show A

# typestable, fast, all hard work done at compile time
T = @tag
@code_warntype @tag
@code_warntype variate(T,8.)
@code_warntype binaryop(Y,Z)
@time variate(@tag,8.)
@time binaryop(Y,Z)

# but... using the sequence in the code as a proxy for the callstack is wishfull thinking
foo(x,y) = binaryop(x,y)
hoo()    = goo(variate(@tag,2.))    # swaping these two lines of codes
goo(x)   = foo(x,variate(@tag,1.))  # changes the results. That's what I asked for, but not what I wanted!

z = hoo()
@show z    

But useless for the simple reason that, I actually want my tags to reflect the nesting in the call stack, or let’s say, the nesting in the scoping stack (the inside of a for loop begin “lower” than the outside of the loop in the same function). I will have to sleep on this. Of course the user could hard-code tags per hand as a workaround but this is hardly a good solution for complicated code.

But I do wonder about how ForwardDiff.jl does this: the tags Tag{F,V}, if I understand correctly, identify a combination of function and format of the inputs - loosely speaking, a method - but why would that be the correct way to nest Duals?

1 Like