Slope at a point of RNN using Flux

Let’s focus on the recurrent unit first. Call it f. It is a function of two variables, the current element of the input sequence x and its internal/hidden state h: y = f(x, h).

Taking the derivative yields

dy = \partial_x f\,dx + \partial_h f\,dh

Your reasoning focused on the first term only and neglected the change in hidden state.

Because the output becomes the hidden state in the next iteration, going through the sequence gives the recursive relation y_k = f(x_k, y_{k-1}) where y_0 is a learned parameter.

Point is, the slope of the graph plotted above is not the change of y_k with x_k but with k. Imagine a continuous version of the map where x(t) and y(t) are parameterized by a real number instead of an integer. Then one could write for some \lambda

y(t) = f(x(t), y(t-\lambda)) You are after dy/dt which involves varying both arguments.

By the recursion relation
\frac{dy_k}{dt} = \partial_x f(x_k, y_{k-1})\,\underbrace{\dot{x}(t)}_{=1} + \partial_h f(x_k, y_{k-1})\,\frac{dy_{k-1}}{dt}
This is again a recursion and thus one needs to accumulate derivatives along the sequence of inputs and hidden states to calculate the sought-after slope.

Lastly, the result is to be multiplied by the Jacobian of the dense layer.

I’ve tried it on a similar network to yours but with one input dimension instead of three (Derivative of RNN approximating piecewise linear function · GitHub), learning a piecewise linear function with slopes 7 and 2.

Going through the motions, I end up with the following comparison between the derivative and finite difference approach.

slope

One could certainly try to train the model better, but I guess the ballpark is alright.