Copy pasting from Slack
Here is the argument in a nutshell. Start with and n to 1 function. The gradient is an n to n function. Reverse mode is higher overhead than forward (because of required memory operations to delay computation) but it requires output many calls and forward is input many. So first one, reverse mode
But the gradient is n to n, so forward since lower overhead. But now, that gives a 2n to 2n function. So forward. But that gives a 4n to 4n function.
So for m derivatives, it’s 2^(m-1) to 2^(m-1) function calculation with m-1 layers of forward over a single reverse.
But that is clearly suboptimal compared to numerical because instead of growing the compute cost by the power it’s just one more f eval for the higher derivative, so linear m times n cost. The only way to get down to that with AD is to do single reverse mode over m-1 Taylor mode, and if the process optimizes everything to remove the redundant bits only then do you achieve the same scaling. Hence building new AD while doing numerical over reverse for now