Loops are not idiomatic (when possible) in Python because they are very slow. But loops in Julia are fast (just as fast as broadcasting) and you should expect similar performance to a batched multiply (maybe marginally worse if the batched stuff is being clever).
In fact, your code would be more idiomatic if you did less batching and more looping!
function step_reference2(A, B, F, n)
# Eigendecomposition: A = V Diagonal(λ) V_inv
λ, V = eigen(A)
# Precompute constant term
BF_transformed = V \ (B * F)
x = zeros(eltype(λ), size(V, 1), n)
for j in 1:n
# this matches what you original wrote but appears to be wrong, scaling in the wrong dimension of V
# @views mul!(x[:, j], V, BF_transformed) # x[:,j] = V * BF_transformed
# @views x[:, j] .*= expm1.(λ .* j) ./ λ # apply the scaling
# what I suspect you intended
y = expm1.(λ .* j) ./ λ .* BF_transformed # apply the scaling
@views mul!(x[:, j], V, y) # x[:,j] = V * y
end
return x
end
This should have similar (or slightly better) performance and (I would say) is simpler than the Python. Note that I suspect a mistake in your original implementation (at least the Julia version) and provided an alternative. In the non-batched version, it is much harder to make the “scaling the wrong dimension” mistake that probably originally appeared in the batched version.
Also, expm1 (available in both Julia and NumPy) can be much more accurate than exp(x)-1.