Obviously it depends. It is possible to use better parallelism on a larger problem, but you need a larger problem and a sophisticated implementation to do so.
If BLAS has an efficient and standardized batched multiply, then LinearAlgebra (or at least a package) could expose it in Julia as well (and already would, presumably). But otherwise I wouldn’t necessarily expect batched multiplies to do much more than internalize the obvious loop.
A GPU is a different story, but then you’re already into packages as far as Julia is concerned. And I’m not sure that NumPy would use a GPU without additional configuration either. And a GPU implementation needs either pre-loaded GPU data or a rather large matrix to be efficient in any case.
No idea where any ecosystem stands on this. The nice thing about a batched multiply frontend is that you can change it to a faster implementation later even if it’s not clever now.