For ML tasks with thousands and millions of inputs and a single output (e.g. loss) forward-mode AD is terribly slow, but there are many other tasks for which it shines.
There are 2 sets of benchmarks for XGrad - for CPU (XGrad vs. ReverseDiff) and for GPU (Arrays vs CuArrays).
Note, that ReverseDiff.jl has several tricks described here that I wasn’t aware of when writing benchmarks (note, that thread is about XDiff.jl - a previous incarnation of XGrad.jl, so don’t be confused with differences). All in all, XGrad and ReverseDiff both apply a number of optimizations and should have very similar performance. If you see some inefficient part in XGrad or high memory footprint, please report.
In practice I always try to use CuArrays when possible, since they give ~10 times improvement on my machine. E.g.:
Compiling derivatives for CPU
0.269616 seconds (290.32 k allocations: 36.576 MiB, 1.29% gc time)
Testing on CPU...
BenchmarkTools.Trial:
memory estimate: 1.15 MiB
allocs estimate: 67
--------------
minimum time: 16.013 ms (0.00% GC)
median time: 20.134 ms (0.00% GC)
mean time: 22.887 ms (0.28% GC)
maximum time: 80.884 ms (0.00% GC)
--------------
samples: 219
evals/sample: 1
Compiling derivatives for GPU
0.264454 seconds (407.77 k allocations: 27.281 MiB, 23.07% gc time)
Testing on GPU...
BenchmarkTools.Trial:
memory estimate: 408.38 KiB
allocs estimate: 611
--------------
minimum time: 745.951 μs (0.00% GC)
median time: 1.922 ms (38.08% GC)
mean time: 1.973 ms (38.86% GC)
maximum time: 4.364 ms (25.42% GC)
--------------
samples: 2529
evals/sample: 1