`Zygote.gradient` is 54000 TIMES slower than `jax.gradient`

mcabbott · January 30, 2025, 6:31pm

I haven’t checked this carefully, but I think that we-writing this not to make slices at all can get us to 2ms:

julia> function loss(
               Wmid, Wctx,
               tok_mid, tok_ctx, x
       )
       tmp = sum(Wmid[tok_mid, :] .* Wctx[tok_ctx, :]; dims=2) |> vec
       -mean(@. x * log(logistic_sigmoid(tmp)) + (1 - x) * log(1 - logistic_sigmoid(tmp)))
       end
loss (generic function with 1 method)

julia> grad = @btime train(Random.Xoshiro(5), 100277, 100, 2);
  1.892 ms (274 allocations: 4.62 MiB)

julia> using Tullio  # first way I thought of

julia> function loss(
               Wmid, Wctx,
               tok_mid, tok_ctx, x
       )
       @tullio tmp[k] := Wmid[tok_mid[k], c] * Wctx[tok_ctx[k], c]  # sum over c
       -mean(@. x * log(logistic_sigmoid(tmp)) + (1 - x) * log(1 - logistic_sigmoid(tmp)))  # sum over k
       end
loss (generic function with 1 method)

julia> grad = @btime train(Random.Xoshiro(5), 100277, 100, 2);
  1.916 ms (348 allocations: 4.62 MiB)

All definitions of loss give me zero gradient, so there could be mistakes.

Topic		Replies	Views
Zygote Performance Machine Learning question	22	4977	September 23, 2019
Zygote Performance (Again...) General Usage zygote , forwarddiff , tullio	17	1801	June 11, 2021
Newbie: Gradient of a gradient performance in Zygote General Usage zygote	2	511	March 21, 2021
Zygote dozens* of times slower than manually written function Performance zygote , forwarddiff	17	1760	April 21, 2022
Compute gradients in neuralODE with Zygote Machine Learning	3	254	August 24, 2023

`Zygote.gradient` is 54000 TIMES slower than `jax.gradient`

Related topics