OK, re-ran some models with different AD backends. I upped the sample to 60 (still with three dimensions), and did 1000 iterations with NUTS.
ReverseDiff (no rdcache): 3422 seconds (one run)
ReverseDiff (with rdcache): 730-760 seconds (two runs)
ForwardDiff: 339-348 seconds (two runs)
Zygote: 138-147 seconds (two runs)
So Zygote is by far the fastest, and the standard reversediff (with no memoization) is… not.