Yes, we’ve had this since 2014, from WIP: cache oblivious linear algebra algorithms by Jutho · Pull Request #6690 · JuliaLang/julia · GitHub
That being said, in principle you can do better at avoiding cache-associativity conflicts by doing tiling with a buffer; FFTW switched to this following Gatlin and Carter (1999). Portability to different cache sizes is less of an issue here because out-of-place matrix transposition has no temporal locality (each memory location is accessed exactly once), so it’s all about optimizing for cache lines (spatial locality).