Novel method for feature importance using ReverseMode + ForwardMode autodiff?

nielsls · November 5, 2023, 8:23pm

Using a combination of reverse- and forward-mode autodiff I have come across a highly efficient method for attributing the change in a differentiable scalar-valued function f: R^N->R^1 (e.g. a neural network) to the input factors.

As far as I can tell, the method of attributing the change in a neural network to the input factors is called Integrated Gradients. However, to do this TensorFlow does slow, cumbersome Riemann integration.

Instead, for a sufficiently differentiable (!) network it may be done by doing nested forwardmode autodiff on: k(t) = grad(f)(x + tv) at t = 0 where x is the baseline and x + v is the observation to be explained.

For details see my write-up higherorder_pnl_explain.pdf
(note: my field is finance where instead of a neural network we try to explain the change in a pricing function for a financial instrument - but the math is the same :)).

I’m pretty sure this is novel within finance but is it also novel within ML?
Or did the TensorFlow guys just not bother doing forward-mode autodiff?

Thought/comments very welcome!

ping @ChrisRackauckas

ChrisRackauckas · November 5, 2023, 8:32pm

Yes this is the forward-over-reverse algorithm for Hessian-vector products (and Hessians). I mention it in the SciML book in “Forward-Over-Reverse and Hessian-Free Products”

https://book.sciml.ai/notes/10-Basic_Parameter_Estimation-Reverse-Mode_AD-and_Inverse_Problems/

nielsls · November 5, 2023, 9:02pm

Yes, but not quite. One forward-pass over reverse only provides a 1st order approximation of the change in the gradient. Instead, the idea is to do M forward-passes over 1 reverse to efficiently approximate the integrated gradient to order M.
Also, wondering why the Google guys didn’t do it… Guess it’s hard in TF - would basically be a one-liner in Julia…

ChrisRackauckas · November 5, 2023, 9:12pm

Griewank’s book has a proof IIRC that M forward over 1 reverse is the optimal way to do higher order.

nielsls · November 7, 2023, 8:04am

Don’t have access to the Griewank book - although it does sound very interesting!

Anyways, for future reference (realizing the initial post was less clear) the key thing is that for a differentiable function f and vectors x,v, we can efficiently calculate the vector G to order M where G fulfills the decomposition: f(x+v) - f(x) = v’G.

Intensive googling suggests this trick is not widely known. Yet I imagine it has many interesting applications - not least in ML and finance.

Topic		Replies	Views
Automatic Differentiation Machine Learning	11	3289	February 11, 2019
Automatic differentiation - ForwardDiff: components of gradient General Usage question , autodiff	1	729	September 21, 2021
Applying Reverse-Mode AD to ForwardDiff Derivatives/Gradients New to Julia flux , forwarddiff , chainrulescore	0	368	May 3, 2022
Autodiffing through DE.jl ODEproblem wrt neural network parameters in reverse mode General Usage question	1	60	March 4, 2025
Getting ForwardDiff.Dual to propagate through function New to Julia optim , forwarddiff , autodiff	5	770	December 13, 2023

Novel method for feature importance using ReverseMode + ForwardMode autodiff?

Related topics