Novel method for feature importance using ReverseMode + ForwardMode autodiff?

Using a combination of reverse- and forward-mode autodiff I have come across a highly efficient method for attributing the change in a differentiable scalar-valued function f: R^N->R^1 (e.g. a neural network) to the input factors.

As far as I can tell, the method of attributing the change in a neural network to the input factors is called Integrated Gradients. However, to do this TensorFlow does slow, cumbersome Riemann integration.

Instead, for a sufficiently differentiable (!) network it may be done by doing nested forwardmode autodiff on: k(t) = grad(f)(x + tv) at t = 0 where x is the baseline and x + v is the observation to be explained.

For details see my write-up higherorder_pnl_explain.pdf
(note: my field is finance where instead of a neural network we try to explain the change in a pricing function for a financial instrument - but the math is the same :)).

I’m pretty sure this is novel within finance but is it also novel within ML?
Or did the TensorFlow guys just not bother doing forward-mode autodiff?

Thought/comments very welcome!

ping @ChrisRackauckas

2 Likes

Yes this is the forward-over-reverse algorithm for Hessian-vector products (and Hessians). I mention it in the SciML book in “Forward-Over-Reverse and Hessian-Free Products”

https://book.sciml.ai/notes/10-Basic_Parameter_Estimation-Reverse-Mode_AD-and_Inverse_Problems/

1 Like

Yes, but not quite. One forward-pass over reverse only provides a 1st order approximation of the change in the gradient. Instead, the idea is to do M forward-passes over 1 reverse to efficiently approximate the integrated gradient to order M.
Also, wondering why the Google guys didn’t do it… Guess it’s hard in TF - would basically be a one-liner in Julia…

1 Like

Griewank’s book has a proof IIRC that M forward over 1 reverse is the optimal way to do higher order.

2 Likes

Don’t have access to the Griewank book - although it does sound very interesting!

Anyways, for future reference (realizing the initial post was less clear) the key thing is that for a differentiable function f and vectors x,v, we can efficiently calculate the vector G to order M where G fulfills the decomposition: f(x+v) - f(x) = v’G.

Intensive googling suggests this trick is not widely known. Yet I imagine it has many interesting applications - not least in ML and finance.