I want to write a neural network using ForwardDiff.jl package but I can’t find any example.
If number of parameters is much greater than number of outputs the you should use reverse mode AD.
In a neural networks there normally are hundreds or thousands of parameters (the weights band biases), and 1 output (the loss).
With that said you can do this.
It will be faster than reverse mode if you have 5-10 parameters.
I first tried to use Flux.jl to reproduce the results of a paper about normalizing flows. My Pytorch code works but the Flux code doesn’t work and it seems that Flux suffers from numerical problems. I don’t care about training time because I don’t think there would be a huge difference.
It is strange that Flux.jl is the main DL package in Julia and it suffers from numerical problems!
If you mean floating point truncation errors, then that is unlikely.
AD doesn’t incur truncation errors.
It may have bugs, if so you should open issues.
but it really shouldn’t have round-off errors.
Feel encourages to start another thread about that, someone might help you debug it.
Or if you can get a vaguely minimal example of an error open an issue on GitHub.
I don’t care about training time because I don’t think there would be a huge difference.
It will be orders of magnitude.
To compute the gradient via forward mode involves basically running the code onces per parameter.
Where as reverse mode is once per output (which is to say once).
Reverse mode does have a higher overhead, but unless the net is a tiny toy from the 90s, then it won’t dominate.
So it is better to use ReverseDiff.jl? My whole model has less than two thousand parameters and I don’t need to train it for long. The loss almost converges after 100 epochs.
I do not think that errors will be a problem. There is a bit outdated package implementing Masked Autoregressive flows here
Also, we have written
which uses “Dense” flows, where Dense matrix is optimized in its SVD form, which allows efficient calculation of Jacobian and inverse. We never had a problem with numerical stability.
Try implementing Spline flows. I tried for months and it seems there is an unsolved bug in Flux.
If bugs are not reported then they are not fixed.