Fate of ReverseDiffSource

For ML tasks with thousands and millions of inputs and a single output (e.g. loss) forward-mode AD is terribly slow, but there are many other tasks for which it shines.

There are 2 sets of benchmarks for XGrad - for CPU (XGrad vs. ReverseDiff) and for GPU (Arrays vs CuArrays).

Note, that ReverseDiff.jl has several tricks described here that I wasn’t aware of when writing benchmarks (note, that thread is about XDiff.jl - a previous incarnation of XGrad.jl, so don’t be confused with differences). All in all, XGrad and ReverseDiff both apply a number of optimizations and should have very similar performance. If you see some inefficient part in XGrad or high memory footprint, please report.

In practice I always try to use CuArrays when possible, since they give ~10 times improvement on my machine. E.g.:

Compiling derivatives for CPU
  0.269616 seconds (290.32 k allocations: 36.576 MiB, 1.29% gc time)
Testing on CPU...
BenchmarkTools.Trial: 
  memory estimate:  1.15 MiB
  allocs estimate:  67
  --------------
  minimum time:     16.013 ms (0.00% GC)
  median time:      20.134 ms (0.00% GC)
  mean time:        22.887 ms (0.28% GC)
  maximum time:     80.884 ms (0.00% GC)
  --------------
  samples:          219
  evals/sample:     1

Compiling derivatives for GPU
  0.264454 seconds (407.77 k allocations: 27.281 MiB, 23.07% gc time)
Testing on GPU...
BenchmarkTools.Trial: 
  memory estimate:  408.38 KiB
  allocs estimate:  611
  --------------
  minimum time:     745.951 μs (0.00% GC)
  median time:      1.922 ms (38.08% GC)
  mean time:        1.973 ms (38.86% GC)
  maximum time:     4.364 ms (25.42% GC)
  --------------
  samples:          2529
  evals/sample:     1
3 Likes