There’s nothing wrong with writing a loop. Done properly, it is usually at least as fast as broadcasting. However, you’ll get better performance from in-place multiplication mul! here (since you already have the output space allocated). Also, as you were perhaps alluding to, there’s no need to loop over s as that part can be handled by using matrix-matrix multiplication instead of matrix-vector.
using LinearAlgebra
for n in axes(a,1)
@views mul!(a[n,:,:], basis, b[n,:,:])
end
I removed @inbounds because there isn’t a check here to ensure that axes(a,1) == axes(b,1) and also because it is unlikely to significantly change performance.
You can write the above via a somewhat arcane broadcast:
using LinearAlgebra
mul!.(eachslice(a;dims=1), Ref(basis), eachslice(b;dims=1))
Although I wouldn’t say this is easier to read that the loop and I don’t expect it to perform noticeably differently. Note that I have broadcasted mul! by using mul!., sliced the arrays by iterating the first dimension, and used Ref(basis) to not broadcast over that.
Yeap, of course the matrix-matrix multiplication would do this already… my bad.
And thanks for the insightful answer and the performance tips using mul!