TF Eager, PyTorch, and older Julia tools (Tracker, ReverseDiff, etc.) use a tape. Tapes are a bit slow because they have overhead in the forward pass and they cannot optimize backwards passes really well because the pass can change each time (every new value can give a new tape). Tensorflow 1.x and Jax work on quasi-static codes, i.e. codes that can be expanded to a static compute graph (not surprising they have the same limitation given they are both built heavily on the same XLA IR). This is very fast but it limits the kinds of codes that can be handled (fully dynamic codes are beyond this limitation) or at least optimize.
Newer Julia tools (Zygote, Diffractor, Enzyme, etc.) work directly on the lowered code doing a source-to-source transformation. Unlike a tape, this source can contain all branches meaning you can fully JIT compile and optimize it without much issue (it won’t change for different forward values). And unlike the XLA-based tools it can fully optimize things which include value-dependent while loops (yes, Jax can represent some of these things but notice its JIT compilation and/or reverse mode AD passes are not compatible with these features because that’s beyond the quasi-static domain they generally allow). These tools have a harder time optimizing some operations as much as XLA because of the much wider program space that they tackle though, but get to rely on the Julia compiler itself and its optimization passes which are steadily improving over time.
Are there cases which thread this needle where PyTorch, Jax, and Tensorflow wouldn’t optimize the method but the Julia tools would? I wrote a blog post detailing one such architecture: Useful Algorithms That Are Not Optimized By Jax, PyTorch, or Tensorflow - Stochastic Lifestyle . But more importantly, once you see one example it’s pretty clear what kinds of algorithms will have that property.