Julia vs NumPy broadcasting

Donald · July 21, 2025, 2:09pm

I’m trying to compute the step response of a linear system with four degrees of freedom in Julia.

This can be done reasonably efficiently in Python using NumPy broadcasting:

import numpy as np

def step_analytical(A, B, F, n):
    # Eigendecomposition: A = V D V_inv
    eigvals, V = np.linalg.eig(A)
    V_inv = np.linalg.inv(V)
    # Precompute constant term
    BF_transformed = V_inv @ (B * F) # (4, 1)
    # Time steps (n,)
    t = np.arange(1, n + 1)
    # Compute exp(D * t) for all t: shape (n, 4)
    exp_diag = np.exp(np.outer(t, eigvals)) # (n, 4)
    # Compute (exp - 1) / lambda
    diag_terms = (exp_diag - 1) / eigvals # (n, 4)
    # Scale V columns by diag_terms
    scaled_V = V[None, :, :] * diag_terms[:, None, :] # (n, 4, 4)
    # Multiply by BF in transformed space
    x = (scaled_V @ BF_transformed).squeeze(-1) # (n, 4)
    return x

# Dynamics matrix A
A = np.array([[-0.19475053,  0.09852727,  0.        ,  0.        ],
              [ 0.01789924, -0.04116155,  0.02326231,  0.        ],
              [ 0.        ,  0.01585978, -0.01863287,  0.00277309],
              [ 0.        ,  0.        ,  0.00061425, -0.00061425]])

# Input matrix B
B = np.array([[0.12675659],
              [0.        ],
              [0.        ],
              [0.        ]])

# Constant forcing term F
F = np.float64(5.156003293238043)

# Number of time steps n
n = 5000

Here is my attempt at a Julia implementation:

using LinearAlgebra

function step_reference(A, B, F, n)
    # Eigendecomposition: A = V D V_inv
    λ, V = eigen(A)
    V_inv = inv(V)
    # Precompute constant term
    BF_transformed = V_inv * (B * F) # (4, 1)
    # Time steps (n,)
    t = 1:n
    # Compute exp(D * t) for all t
    exp_diag = exp.(λ * t') # (4, n)
    # Compute (exp - 1) / λ
    diag_terms = (exp_diag .- 1) ./ λ # (4, n)
    # Scale V columns by diag_terms: shape (4, 4, n)
    V_scaled = reshape(V', 4, 4, 1) .* reshape(diag_terms, 4, 1, n)
    # Multiply by BF in transformed space
    x = zeros(eltype(λ), 4, n)
    for j in 1:n
        x[:, j] = V_scaled[:, :, j]' * BF_transformed
    end
    return x
end

# Dynamics matrix A
A = [
    -0.19475053  0.09852727  0.0        0.0;
     0.01789924 -0.04116155  0.02326231 0.0;
     0.0        0.01585978  -0.01863287 0.00277309;
     0.0        0.0          0.00061425 -0.00061425
]

# Input matrix B
B = [
    0.12675659;
    0.0;
    0.0;
    0.0
]

# Constant forcing term F
F = 5.156003293238043

# Number of time steps n
n = 5000

I’ve reimplemented most of the function using broadcasting but I’m stuck at the final step, where I resort to a loop.

Unlike in NumPy, Julia doesn’t seem to support batched matrix multiplication without additional packages, which leads me to believe that what I’m trying to do is not idiomatic Julia. I notice that my Julia code here is also running slower than NumPy, possibly due to the large number of allocations.

Any hints on how to improve this would be much appreciated.

nilshg · July 21, 2025, 2:20pm

You might get better answers if you provide dummy data for the function inputs.

giordano · July 21, 2025, 2:59pm

Two notes:

unlike numpy, slicing in julia copies, you may want to use views instead
since you already have have the output array allocated, you probably want to use LinearAlgebra.mul! to do the multiplication in-place.

These two will bring allocations to 0 in the loop.

Donald · July 21, 2025, 3:14pm

Thank you for the suggestion. I have added some input data.

gdalle · July 21, 2025, 3:17pm

Since your arrays have dimension 4, everything will be much faster if you use StaticArrays.jl

Donald · July 21, 2025, 3:24pm

Thank you everyone for your responses. Having now run both the Julia and Python code with the same inputs, I find that there is a bug somewhere in my Julia translation! Once I find it I will try out your suggestions…

mikmoore · July 21, 2025, 3:35pm

Loops are not idiomatic (when possible) in Python because they are very slow. But loops in Julia are fast (just as fast as broadcasting) and you should expect similar performance to a batched multiply (maybe marginally worse if the batched stuff is being clever).

In fact, your code would be more idiomatic if you did less batching and more looping!

function step_reference2(A, B, F, n)
    # Eigendecomposition: A = V Diagonal(λ) V_inv
    λ, V = eigen(A)
    # Precompute constant term
    BF_transformed = V \ (B * F)
    x = zeros(eltype(λ), size(V, 1), n)
    for j in 1:n
        # this matches what you original wrote but appears to be wrong, scaling in the wrong dimension of V
        # @views mul!(x[:, j], V, BF_transformed) # x[:,j] = V * BF_transformed
        # @views x[:, j] .*= expm1.(λ .* j) ./ λ # apply the scaling
        # what I suspect you intended
        y = expm1.(λ .* j) ./ λ .* BF_transformed # apply the scaling
        @views mul!(x[:, j], V, y) # x[:,j] = V * y
    end
    return x
end

This should have similar (or slightly better) performance and (I would say) is simpler than the Python. Note that I suspect a mistake in your original implementation (at least the Julia version) and provided an alternative. In the non-batched version, it is much harder to make the “scaling the wrong dimension” mistake that probably originally appeared in the batched version.

Also, expm1 (available in both Julia and NumPy) can be much more accurate than exp(x)-1.

Donald · July 21, 2025, 4:01pm

Thank you! I added a couple of transposes to my original code to correct the scaling.

Vasily_Pisarev · July 21, 2025, 4:02pm

I think you can do mapslices instead of loop.

And I’m seconding @gdalle on using StaticArrays.jl for the arrays of known dimension

function step_analytical(A::AbstractMatrix, B::AbstractArray, F::Number, n::Integer)
    # Eigendecomposition: A = V D V_inv
    eigvals, V = eigen(A)
    # Precompute constant term
    BF_transformed = V \ (B .* F) # (4, 1), must be the same as inv(V) * (B .* F)
    # Time steps (n,)
    t = 1:n
    # Compute (exp - 1) / lambda
    diag_terms = expm1.(t' .* eigvals) ./ eigvals # (4, n) Must be more numerically stable than exp(x) - 1
    # Scale V columns by diag_terms
    V_scaled = reshape(V, 4, 4, :) .* reshape(diag_terms, 4, :, n)
    # Multiply by BF in transformed space
    x = mapslices(V_scaled; dims=(1, 2)) do v # (4, n)
        v * BF_transformed
    end
    return x
end

For static arrays:

using LinearAlgebra
using StaticArrays

function step_analytical(A::SMatrix, B::SArray, F::Number, n::Integer)
    # Eigendecomposition: A = V D V_inv
    eigvals, V = eigen(A)
    # Precompute constant term
    BF_transformed = V \ (B * F) # (4, 1), must be the same as inv(V) * (B .* F)
    # Compute (exp - 1) / lambda
    diag_terms = [expm1.(t .* eigvals) ./ eigvals for t in 1:n] # (4, n) Must be more numerically stable than exp(x) - 1
    # Scale V columns by diag_terms
    V_scaled = V .* Diagonal.(diag_terms)
    # Multiply by BF in transformed space
    x = V_scaled .* BF_transformed
    return x
end

Benny · July 21, 2025, 4:19pm

Is that really the case? I thought batched matmul functions generally had some optimizations, at least by automatically leveraging parallelism on multiple cores or a GPU. A plain for-loop or broadcasting won’t do that for us.

mikmoore · July 21, 2025, 5:20pm

Obviously it depends. It is possible to use better parallelism on a larger problem, but you need a larger problem and a sophisticated implementation to do so.

If BLAS has an efficient and standardized batched multiply, then LinearAlgebra (or at least a package) could expose it in Julia as well (and already would, presumably). But otherwise I wouldn’t necessarily expect batched multiplies to do much more than internalize the obvious loop.

A GPU is a different story, but then you’re already into packages as far as Julia is concerned. And I’m not sure that NumPy would use a GPU without additional configuration either. And a GPU implementation needs either pre-loaded GPU data or a rather large matrix to be efficient in any case.

No idea where any ecosystem stands on this. The nice thing about a batched multiply frontend is that you can change it to a faster implementation later even if it’s not clever now.

gdalle · July 21, 2025, 6:00pm

I think Reactant.jl can optimize batched matmul @avikpal?
Although I expect StaticArrays.jl plus broadcasting will be much faster than the slicing, and possibly liable to Reactant.jl optimization too.

mcabbott · July 21, 2025, 6:08pm

Do we need batched matrix multiplication? At least if @mikmoore 's version is correct, the loop is doing repeated matrix-vector multiplication x[a, j] = V[a,b] * yj[b] in a loop over j, whose batched version is matrix-matrix multiplication, x[a, j] = V[a,b] * ys[b, j]. For me that’s 20x quicker:

function step_reference3(A::Matrix, B::Vector, F::Real, n::Int)  # my adaptation of @mikmoore's version
    λ, V = eigen(A)
    BF_transformed = V \ (B * F)  # 4-element Vector{Float64}
    ys = expm1.(λ .* (1:n)') ./ λ .* BF_transformed
    x = V * ys  # x[a, j] = V[a,b] * ys[b, j]
end

@btime step_reference($A, $B, $F, $n)  #  2.602 ms (58011 allocations: 3.29 MiB)
@btime step_reference2($A, $B, $F, $n)  # 3.401 ms (45033 allocations: 2.22 MiB)
@btime step_reference3($A, $B, $F, $n)  # 155.708 μs (41 allocations: 326.47 KiB)

While loops are quick in Julia, loops which allocate small vectors are expensive. (The above SMatrix version didn’t run for me, on a quick attempt.)

stevengj · July 21, 2025, 6:45pm

(I’m less worried about using expm1 here and more worried that you are using diagonalization (eigenvectors) to compute matrix exponentials of a non-Hermitian matrix. This can be catastrophically inaccurate if the matrix happens to be nearly defective.)

Note that StaticArrays.jl doesn’t implement a fast eigen the non-Hermitian case, but it does have a fast exp. Maybe compute E = exp(A) and then compute E^n for each n by iterative multiplication?

stevengj · July 21, 2025, 6:57pm

Below is a version that (a) uses exp(A) rather than trying to diagonalize A (which is dangerous), and (b) uses StaticArrays in an allocation-free loop. On my computer, it is 3–4x faster than @mcabbott’s stepreference3 above:

using LinearAlgebra, StaticArrays

function step_sgj(A::SMatrix, B::SArray, F::Real, n::Integer)
    eᴬ = exp(A)
    C = A \ B * F
    result = Vector{typeof(C)}(undef, n)
    eᴬᵏ = eᴬ
    result[1] = (eᴬᵏ - I) * C
    for k = 2:n
        eᴬᵏ *= eᴬ
        result[k] = (eᴬᵏ - I) * C
    end
    return result
end

result = step_sgj(SMatrix{4,4}(A), SVector{4}(B), F, n)
@show stack(result) ≈ step_reference3(A, B, F, n)

@btime step_reference3($A, $B, $F, $n);
@btime step_sgj($(SMatrix{4,4}(A)), $(SVector{4}(B)), $F, $n);

prints:

stack(result) ≈ step_reference3(A, B, F, n) = true

112.666 μs (41 allocations: 326.47 KiB)
  32.958 μs (3 allocations: 160.06 KiB)

(In addition to being faster, I find it a lot easier to decipher what the code is actually doing in this form, rather than the “vectorized” form.)

mikmoore · July 21, 2025, 10:35pm

If one is going to take this implementation seriously, one should also avoid assuming A is nonsingular by doing something like

"""
expm1o(X)

Compute `(exp(X) - I) / X` safely (even when `X` is singular).
"""
function expm1o(X::AbstractMatrix)
    i = size(X, 1)
    return exp([zero(X); zero(X);; one(X); X])[begin:i, begin+i:begin+2i-1]
end
expm1o(X::Number) = only(expm1o([X;;])) # obviously one can do better, just here for completeness

function step_mikm(A, B, F, n)
    BF = B * F
    x = map(j -> j * (expm1o(A*j) * BF), 1:n)
    return x
    # or reduce(hcat, x) if one wants this as a matrix
    # or just build the matrix that way
end

One could make a nicer expm1o that uses fewer allocations (especially given a SMatrix), but that’s an exercise for the reader. And while there exist more efficient algorithms (that avoid using a dimension-doubled matrix), they’re as tedious as writing your own exp function so may not be worth the effort.

stevengj · July 22, 2025, 12:58am

I’m guessing that the OP has physical grounds to assume that A is not singular (or close to singular), as this was already assumed by the original code.

I agree that this assumption is not mathematically necessary for the calculation being performed here, but it sure complicates the algorithm (and makes it a lot more expensive, since you now are calling exp or its equivalent n times, unless there is some clever way around this). So it would be good to exploit it if the nonsingular property is present in the underlying problem.

(Whereas diagonalizability for a non-Hermitian matrix seems much harder to know on physical grounds, and in any case calling exp(A) instead of diagonalizing results in a faster and simpler algorithm.)

Still, that’s a nice trick for (e^X - I) / X; is there a reference for this?

mikmoore · July 22, 2025, 1:46am

~~I don’t have a reference handy, sorry. But if you think that’s neat, then you should see~~ EDIT: Yes, it’s (1.1) in this paper. ~~I guess you could say this is a simpler version of some of the stuff done there.~~

I’ve run into this equation before when I had a linear differential equation with a matrix shaped like that augmented one (e.g., \ddot{x} = A \dot{x}). In my particular case I definitely had to handle singular matrices.

One can recognize it as having the same Taylor series as the exponential with the zeroth-order coefficient removed and the other coefficients shifted down by one order. When I needed it, I implemented with with a calculation basically mirroring the matrix exponential (the doubling step takes a touch more effort). Done that way, it has a very similar cost. I briefly considered making a library for this family of functions but, apart from speed, there isn’t an obvious advantage over the augmented matrix exponential method. And since I couldn’t find such a library in any language (or even much discussion of this family of functions), it seemed like there wasn’t enough demand to be worth the maintenance.

Donald · July 22, 2025, 10:12am

Yes, physically A cannot be singular here for reasonable parameter values. However, as part of a larger optimisation problem, it’s possible that the code could be run with unfortunate inputs such that A is nearly singular. I was using some regularization on eigenvalues of A for this reason (also to keep them from getting too close to each other).

Is this approach of repeatedly multiplying by e^A recommended when computing e^{At} for integer t? I imagine you would get a build-up of rounding errors.

braamvandyk · July 22, 2025, 10:29am

Reactant, or at least the artifact it needs, is currently Linux only.

Topic		Replies	Views
Numpy 10x faster than Julia ?! What am I doing wrong ?! [solved - julia faster now] Performance question	37	11162	October 15, 2019
Why is this Julia code considerably slower than Matlab New to Julia performance	64	8636	March 5, 2017
Linear solver \(A, B) performance vs Matlab A\b General Usage	32	7847	May 21, 2017
Kron vs scalar product speed difference. python code faster? New to Julia question	41	4260	January 14, 2017
Multiply many-matrices by many-vectors Performance matlab , parallel , multithreading , tensors	33	6854	December 14, 2018

Julia vs NumPy broadcasting

Related topics