Speeding up per-sample gradients?

ToucheSir · January 30, 2024, 4:31pm

Mostly. Just broadcasting won’t work as well as vmap, however, because some of the operations being broadcasted are already vectorized (e.g. BLAS). vmap will actually modify those calls (using dispatch in PyTorch and source code transforms in JAX) to use batched implementations whenever it encounters them.

Now that said, some functions and Flux layers are flexible enough to already work for this without a vmap-like treatment. See this topic posted about a month ago: Flux loss with contribution gradient is slow - #5 by Jonas208. Basically, changing your loss function to compute a loss for each sample individually and then summing should be enough for a MLP. In fact, mean_batch_grad and map_grad currently return the exact same gradients because sum(map(x -> sum(model(xs)), xs)) == sum(model(xs))!

Topic		Replies	Views
`Zygote.gradient` is 54000 TIMES slower than `jax.gradient` Optimization (Mathematical) zygote , jax	80	1313	February 1, 2025
Forward- and reverse-mode AD comparisons with JAX Performance zygote , forwarddiff , autodiff	7	1414	December 4, 2021
ForwardDiff - Multiple gradient evaluations at once? General Usage	8	797	July 17, 2019
Automatic gradient ∼10x slower to evaluate than the primal computation Machine Learning performance , flux , adjoint , chainrulescore , gradient	2	418	February 3, 2023
Performance comparison - Flux.jl's Adam vs Jax's Adam Performance question , package , performance , flux	33	3891	November 3, 2022

Speeding up per-sample gradients?

Related topics