In-place orthogonal/unitary transforms?

Say I want to perform a unitary/orthogonal transformation of an operator:

U^\dagger A U

Is there a way to do this without allocating a new array?

I see that there is a method for lmul!(A,B) that will do AB and place the elements into B, but there are strict requirements on the shape of A. Specifically, the documentation says:

Diagonal , UpperTriangular or LowerTriangular , or of some orthogonal type, see QR .

In my work this is a very common type of operation so I figured there would be some built-in for it. Maybe there is a work around that I am missing for arbitrary-shaped U.

1 Like

Is it even possible to do this in-place? I’d think you’d need a buffer…

1 Like

You need at least one buffer, but if you’re doing this many times, you can just re-use that buffer each time.


That said, I would look into whether or not your unitary transformation really needs to be represented with matrix multiplication. That’s the most general, but usually least efficient way to do it. For instance, many unitary transformations are really just simple permutations that can be done very efficiently without matrix multiplication.


Thanks for the quick replies. I currently have it implemented with a buffer, I was just wondering if there was some work-around I was missing because of the symmetry of the transform and the unitary property. These replies seem to answer my question for now.

@Mason I see your background is condensed matter. These unitaries are generated via a solution to a time-dependent Schrodinger equation with a time-dependent Hamiltonian, and thus are time-dependent themselves. So unfortunately, they aren’t trivial enough to work out by hand usually.

1 Like

Hm, I see. In that case, I really doubt the point of this exercise. Finding the unitary matrices themselves should absolutely dominate the time it takes to do the matrix multiplications in most cases, and the time to allocate the buffers (instead of re-using them) is also dwarfed by the time it takes to do the multiplications themselves.

julia> using LinearAlgebra, BenchmarkTools

julia> function fmul!(C, D, A, U)
           mul!(D, A, U)
           mul!(C, U', D)

julia> fmul!(C, A, U) = mul!(C, U'A, U);

julia> fmul(A, U) = U'A*U;
julia> foreach((10, 100, 500)) do n
           @show n
           A = randn(n, n)
           A = A + A'
           U = @btime eigvecs($A)
           C = similar(U)
           D = similar(U)
           @btime fmul($A, $U)
           @btime fmul!($C, $A, $U)
           @btime fmul!($C, $D, $A, $U)
n = 10
  24.530 μs (11 allocations: 6.88 KiB)
  729.308 ns (2 allocations: 1.75 KiB)
  694.701 ns (1 allocation: 896 bytes)
  643.359 ns (0 allocations: 0 bytes)

n = 100
  1.378 ms (14 allocations: 272.09 KiB)
  84.021 μs (4 allocations: 156.41 KiB)
  83.221 μs (2 allocations: 78.20 KiB)
  81.621 μs (0 allocations: 0 bytes)

n = 500
  29.352 ms (14 allocations: 5.90 MiB)
  4.042 ms (4 allocations: 3.81 MiB)
  4.051 ms (2 allocations: 1.91 MiB)
  4.008 ms (0 allocations: 0 bytes)

So finding U is taking orders of magnitude longer than the matmuls, and the matmuls are taking orders of magnitude longer than the allocation of the memory buffers. I suspect that optimizing this is therefore not worth your time.

1 Like

If you really want to, you can do it in place if you have your unitary in the form of Householder reflections, i.e. the output of the Q matrix of a QR decomposition. Hence, you could qr! your unitary (which does some extra allocations for work space; you can reduce allocations but sacrifice performance by using the unblocked lapack versions which require less work space). The R should then be a diagonal matrix, so you would need to write your own interface to extract the diagonal elements of R and multiplying A with R and R’ from left and right without allocating; then you could use rmul!(lmul!(Q, A), Q').

But as stated by others, this is likely not going to be more efficient nor worth optimizing.


Representing the propagator as an explicit dense matrix doesn’t sound scalable (in 3d the matrices will be too big to store). You really don’t want to be dealing with dense matrices for PDE problems. Or is this just the propagator projected onto a small subspace of functions?

Can’t you represent the propagator implicitly (i.e. as a black box applicable to any given vector) as the solution to the initial value problem, e.g. solved by the method of lines with DifferentialEquations or using some other time-marching scheme (depending on your spatial basis or discretization)? Or rethink your approach at a higher level, depending on what problem you are ultimately solving.

1 Like

The problem we are solving is one in quantum optimal control and quantum computing. In this situation we need the full unitary operators (propagators) explicitly with very low errors in order to accurately assess the fidelity of the operators implemented. So in this case we really have to treat these unitaries exactly because it is not their action on a particular vector or subset of vectors that is of interest, but the unitaries themselves.

We are not generally considering a spatial quantum mechanical problem, but a finite dimensional one (usually the most relevant subspace of some infinite dimensional space). Specifically, we are considering multiple qubits (or the physical objects that represent qubits) and thus the size of this subspace scales exponentially with the number of qubits.

There really isn’t a general “scalable” approach to these problems (without using a quantum computer). Like Mason said, at this point the transformations are taking less time than the actual SE solver so I won’t worry too much about it for now.

The reason I asked this question was because I was curious if this type of operation was considered standard enough for there to be a base implementation.


Yeah, this is something I’ve thought a little bit about. I had hoped that Tullio.jl plus LoopVectorization.jl could derive a sufficiently clever custom kernel automatically that it’d beat BLAS for this, but that hope unfortunately did not materialize:

julia> using LinearAlgebra, BenchmarkTools, Tullio, LoopVectorization

julia> function fmul!(C, D, A, U)
           mul!(D, A, U)
           mul!(C, U', D)

julia> ftullio!(C, A, U) = @tullio C[i, j] = U'[i, k] * A[k, l] * U[l, j];

julia> function favx!(C, A, U)
           n = size(C, 1)
           @assert (n, n) == size(C) == size(A) == size(U)
           ax = axes(C, 1)
           @avx for i ∈ ax, j ∈ ax
               Cij = zero(eltype(C))
               for k ∈ ax, l ∈ ax
                   Cij += conj(U[k, i]) * A[k, l] * U[l, j]
               C[i, j] = Cij
favx! (generic function with 1 method)
julia> foreach((5, 10, 20, 50, 100, 500)) do n
           @show n
           A = randn(n, n); A = A + A'
           U = eigvecs(A)
           C = similar(U); D = similar(U)
           @btime ftullio!($C, $A, $U)
           @btime favx!($C, $A, $U)
           @btime fmul!($C, $D, $A, $U)
n = 5
  199.142 ns (0 allocations: 0 bytes)
  197.363 ns (0 allocations: 0 bytes)
  335.792 ns (0 allocations: 0 bytes)

n = 10
  1.406 μs (0 allocations: 0 bytes)
  1.409 μs (0 allocations: 0 bytes)
  655.056 ns (0 allocations: 0 bytes)

n = 20
  17.910 μs (0 allocations: 0 bytes)
  17.800 μs (0 allocations: 0 bytes)
  2.322 μs (0 allocations: 0 bytes)

n = 50
  171.148 μs (73 allocations: 5.47 KiB)
  788.392 μs (0 allocations: 0 bytes)
  25.299 μs (0 allocations: 0 bytes)

n = 100
  2.277 ms (73 allocations: 5.47 KiB)
  11.757 ms (0 allocations: 0 bytes)
  78.519 μs (0 allocations: 0 bytes)

n = 500
  1.404 s (75 allocations: 5.53 KiB)
  7.343 s (0 allocations: 0 bytes)
  4.129 ms (0 allocations: 0 bytes)

(Note, due to, you will need to use Tullio version 0.2.8 or earlier to reproduce these times, not the latest version.)

I think in theory, LoopVectorization.jl should be able to create a pretty good microkernel for this, but perhaps the optimal algorithm is really just one matmul after another. In some senses, this makes sense because the naive algorithm involving two temporary buffers is basically trading extra memory usage for better cache locality. But it could also be that this is just a use-case that LoopVectorization.jl is not well tuned for.

I’d be interested to know if anyone knowledgeable about LoopVectortorization.jl and BLAS has any thoughts on this. @Elrod?

1 Like

The two multiplications are O(N^3), while favx! is O(N^4). I assume ftullio! is as well, but at least benefits from multithreading.

I’ll take a look at the generated code to make sure it looks decent, aside from bad algorithmic complexity.
It’d be a cool optimization for LoopVectorization to transform those loops into a sequence of two matmuls. How to implement that is something to think about. Any ideas on how to recognize transformation opportunities like that?
Relevant issue (filed by @mcabbott).


Ah of course, I should have thought more about the algorithmic complexity.

Hm, in this case, to take advantage of this you’d need to allocate a second buffer though, right? Or I guess maybe the user could give permission for you to overwrite U or A, but that sounds scary.

I think this might be a thing where you are forced to choose between N^4 complexity or double the memory usage, but maybe I’m wrong.

I think it is well known / generally accepted, that in contracting a bunch of tensors (multiplying a bunch of matrices is just a special case), the computational efficiency is always minimized by doing pairwise contractions of the tensors. But indeed, this has the downside of a memory overhead; you need to store all the intermediate temporaries.

In the context of TensorOperations.jl for contracting arbitrary tensor networks, I have noticed that Julia’s GC still has serious issues (as in, slowdowns) with allocating large temporaries, even if these are always used in exactly the same manner. That’s why the @tensor macro allocates temporaries in a package wide global cache (using an LRU structure) such that it can recycle existing temporaries rather than constantly allocating new ones, and letting the GC clean up the old ones.

In a simple case as this, where you only need one temporary, it will probably still be more efficient to manage it manually, but if you contract many tensors together, that would be a big annoyance.