Zygote is a source-to-source AD, so instead of getting back an explicit graph like you would in TensorFlow, it compiles a function that returns the gradients. After the first compilation, subsequent calls should be able to use the cached version of this function (and the generated functions it calls). You can see this effect by @timeing the first call to gradient or using @benchmark (which reports max times) instead of just @btime.
1 Like