Julia vs NumPy broadcasting

Is that really the case? I thought batched matmul functions generally had some optimizations, at least by automatically leveraging parallelism on multiple cores or a GPU. A plain for-loop or broadcasting won’t do that for us.