Enzyme AD with matrix exponential

Hey,

I am currently trying to use Enzyme’s gradient function to get the gradient of a quantum mechanical expectation value in respect to some parameters which go into the Hamiltonian.
What causes a problem? AD with Enzyme if my function uses matrix exponentials.

Maybe some math helps:

My Hamiltonian H(\vec{x}) (a matrix) depends on parameters \vec{x}.
The time evolution is then given by U(\vec{x}) = e^{- i H (\vec{x}) t}
Finally the expectation value of observales O_j (constant matrices) after time evolution is then given by E(\vec{x}) = \text{Tr} [ O_j U(\vec{x}) \rho U^{\dagger}(\vec{x}) ] , where \rho is just my density matrix (constant again)

At the end I just want the gradient of the expectation value (\nabla_{\vec{x}} E(\vec{x})) via automatic differentiation with Enzyme.

I tried Base.exp() and functions from ExponentialUtilities.jl , KrylovKit.jl, ExponentialAction.jl without any succes.

MWE for my expectation value:

using Enzyme, LinearAlgebra

function expect(x::Vector{Float64})
    ρ = 1/2*[1 0; 0 1] #density matrix describing my system
    H = [x[1] 0;0 -x[1]] + [0 x[2];-x[2] 0] #Hamiltonian
    U = exp(-1im*H) #time evolution operator, THIS ONE CAUSES ME PROBLEMS
    O = [1 0; 0 -1] #some observable
    ρ_evolved = U*ρ*U'
    return real(tr(O*ρ_evolved)) #expectation value
end 

and simply calling

res = Enzyme.gradient(Reverse, x -> expect(x), rand(2))

causes the error message

┌ Warning: Using fallback BLAS replacements for ([“zherk_64_”, “zgemm_64_”]), performance may be degraded
└ @ Enzyme.Compiler ~/.julia/packages/GPUCompiler/GnbhK/src/utils.jl:59
ERROR: LoadError:
No augmented forward pass found for ejlstr$zgebal_64_$libblastrampoline.so.5
at context: call void @“ejlstr$zgebal_64_$libblastrampoline.so.5”(i8* nonnull %10, i8* nonnull %13, i64 %94, i8* nonnull %16, i64 %95, i64 %96, i64 %97, i64 %98, i64 1) #244 [ “jl_roots”({} addrspace(10)* null, { i8*, {} addrspace(10)* } %93, {} addrspace(10)* null, {} addrspace(10)* null, {} addrspace(10)* null, { i8*, {} addrspace(10)* } %90, {} addrspace(10)* null, {} addrspace(10)* null) ], !dbg !298
Stacktrace:
[1] gebal!
@ ~/.julia/juliaup/julia-1.11.1+0.x64.linux.gnu/share/julia/stdlib/v1.11/LinearAlgebra/src/lapack.jl:243
Stacktrace:
[1] gebal!
@ ~/.julia/juliaup/julia-1.11.1+0.x64.linux.gnu/share/julia/stdlib/v1.11/LinearAlgebra/src/lapack.jl:243
[2] exp!
@ ~/.julia/juliaup/julia-1.11.1+0.x64.linux.gnu/share/julia/stdlib/v1.11/LinearAlgebra/src/dense.jl:682
[3] exp
@ ~/.julia/juliaup/julia-1.11.1+0.x64.linux.gnu/share/julia/stdlib/v1.11/LinearAlgebra/src/dense.jl:622
[4] expect
@ ~/Documents/test.jl:5
[5] #1
@ ~/Documents/test.jl:11 [inlined]
[6] diffejulia__1_7124wrap
@ ~/Documents/test.jl:0
[7] macro expansion
@ ~/.julia/packages/Enzyme/RTS5U/src/compiler.jl:8398 [inlined]
[8] enzyme_call
@ ~/.julia/packages/Enzyme/RTS5U/src/compiler.jl:7950 [inlined]
[9] CombinedAdjointThunk
@ ~/.julia/packages/Enzyme/RTS5U/src/compiler.jl:7723 [inlined]
[10] autodiff
@ ~/.julia/packages/Enzyme/RTS5U/src/Enzyme.jl:491 [inlined]
[11] autodiff
@ ~/.julia/packages/Enzyme/RTS5U/src/Enzyme.jl:512 [inlined]
[12] macro expansion
@ ~/.julia/packages/Enzyme/RTS5U/src/Enzyme.jl:1719 [inlined]
[13] gradient(::ReverseMode{false, false, FFIABI, false, false}, ::var"#1#2", ::Vector{Float64})
@ Enzyme ~/.julia/packages/Enzyme/RTS5U/src/Enzyme.jl:1660
[14] top-level scope
@ ~/Documents/test.jl:11

Versions:

  • Julia: v1.11.1
  • LinearAlgebra: v1.11.0
  • Enzyme: v0.13.16

I already have a working implementation with Zygote. Why do i want to swtich to Enzyme?: To get rid of allocations. Since Zygote doesn’t allow mutation, i cannot use preallocated buffers and I need to rely on A*B for matrix multiplication instead of mul!(C, A, B).

My backround: More or less only physics and maths and I’d call myself a newbie regarding AD.
Any input is appreciated a lot.

Apologies for my previous post, I panicked as I accidentally posted the issue without correctly formatting code.

I can’t help you with Enzyme specifically, but I do want to point out that calculating the gradient of the matrix exponential with respect to parameters in the Hamiltonian has a nice analytical solution. It’s originally a linear-algebra result, but has been somewhat widely adopted in quantum control.

e^{-i G dt} \begin{pmatrix} 0 \\ \vdots \\ 0 \\ |Ψ⟩ \end{pmatrix} = \begin{pmatrix} \frac{∂}{∂ϵ₁} e^{-i H dt} |Ψ⟩ \\ \vdots \\ \frac{∂}{∂ϵₙ} e^{-i H dt} |Ψ⟩ \\ e^{-i H dt} |Ψ⟩ \end{pmatrix}.

where

G = \begin{pmatrix} H(t) & 0 & \dots & 0 & H₁(t) \\ 0 & H(t) & \dots & 0 & H₂(t) \\ \vdots & & \ddots & & \vdots \\ 0 & 0 & \dots & H(t) & Hₙ(t) \\ 0 & 0 & \dots & 0 & H(t) \end{pmatrix}

and Hₙ(t) is the the derivative of H with respect to the n’th parameter \epsilon_n.

I’m actually quite sure that this result is equivalent to a specific way of “forward-mode AD”, but I haven’t run through the math to see if that interpretation is correct.

Some literature on this to potentially cite:

[1] C. F. Van Loan, Computing integrals involving the matrix exponential, IEEE Trans. Automat. Contr. 23, 395 (1978).

[2] P. de Fouquiéres, S. G. Schirmer, S. J. Glaser, and I. Kuprov, Second order gradient ascent pulse engineering, J. Magnet. Res. 212, 412 (2011).

[3] D. L. Goodwin and I. Kuprov, Auxiliary matrix formalism for interaction representation transformations, optimal control, and spin relaxation theories, J. Chem. Phys. 143, 084113 (2015).

[4] S. Machnes, E. Assémat, D. Tannor, and F. K. Wilhelm, Tunable, flexible, and efficient optimization of control pulses for practical qubits, Phys. Rev. Lett. 120, 150401 (2018).

There is an implementation of this concept in QuantumGradientGenerators.jl, see the API docs. It’s not quite ready yet for the exact use case of static parameters in the Hamiltonian. It’s currently for derivatives w.r.t. control functions for use in GRAPE.jl. Ultimately, it should be generalized, and that’s planned, for use in the WIP ParameterizedQuantumControl.jl.

So that code might not help you much right now, but you might be able to adapt it, or just use the underlying mathematical result to solve your problem.

3 Likes

Hi @stone44, it looks like there is no support in Enzyme.jl for matrix exponentials, according to these issues:

This section in the documentation of Enzyme.jl might be of interest. You can also test out and benchmark other AD packages in Julia quickly with DifferentiationInterface.jl.

1 Like

Like all forward-mode methods, this is an inefficient way to compute the gradient of a scalar function of the output (as is the case here, where in the end you take a trace).

In particular, if your original matrix is m \times m, and you have n parameters, you end up computing an m(n+1) \times m(n+1) matrix exponential (though it’s probably not quite that bad if you exploit the structure of your matrix … even so, I expect the cost will be at least n times the cost of a single m \times m exponential, as is typical for forward-mode methods).

In contrast, if you employ a reverse-mode approach to compute the gradient of a scalar function, then there is a simple formula (Mathias, 1996) involving exponentiating a 2m \times 2m matrix that (when applied in an “adjoint”/reverse-mode fashion) gets you all n derivatives at once.

We assigned this as a pair of homework problems in our Matrix Calculus course at MIT last year: see problem 6 in pset 1 (matrix-function Jacobian operator) and problem 4.2 in pset 2 (applying it efficiently in reverse mode).

(The approach above works for any sufficiently smooth matrix function f(A). For matrix exponentation in particular, however, there is an even more efficient algorithm that is implemented in ChainRules.jl … see this issue for references and discussion.)

PS. I have to come up with new homework assignments for this January’s Matrix Calculus class — your observable-expectation value example will make a good problem, thanks!

2 Likes

This is a little more subtle, so I’m going to have to get a lot more technical here… I’m not actually recommending the instantiation of any matrices or the use of any matrix-matrix-multiplications. I’m certainly not saying that huge block-sparse matrix G should ever be instantiated, and most definitely, it should not be explicitly exponentiated.

One key is that I was talking about calculating the derivative of the application of the matrix-exponential to a vector (which is sufficient for the OP’s problem), not the derivative of the matrix exponential as another matrix.

To simplify the notation in the OP’s problem a little bit, let’s say you want to calculate \nabla_{\epsilon_n} J for the scalar function J = |\langle \Psi_{tgt} \vert U \vert \Psi \rangle|^2 with U = \exp{\left[-i H(\{\epsilon_n\}) t\right]}.

For anyone unfamiliar with Braket notation of quantum mechanics, \vert \Psi \rangle is just the notation for a complex vector, what you might otherwise write as \vec{\Psi}, the notation \langle \Psi \vert is for the co-state (\vec{\Psi})^\dagger (complex conjugate-transpose), and \langle \Psi_1 \vert \Psi_2\rangle is the inner product between two vectors. The “target” state |\Psi_{tgt}\rangle takes care of the O in the original problem.

So now you parallelize your calculation for the different parameters \epsilon_n. That is, we’ll only look at one \epsilon_n. For every \epsilon_n, you now have the (abstract) block-matrix

G = \begin{pmatrix} H & H_n \\ 0 & H \end{pmatrix}

where H_n = \partial H/\partial \epsilon_n. Now you have to calulate the application of the matrix-exponential of that G to an abstract vector consisting of two blocks, a |\Psi_n\rangle originally initialized to zero, and your original vector |\Psi\rangle. This evaluation of the exponential you can do in whatever way you would normally evaluate U |\Psi\rangle. Assuming the dimensions aren’t trivially small, you’d typically do this by expanding it into a polynomial series. Think Taylor, but Taylor happens to converge pretty slowly. If H is Hermitian, the provably fastest converging polynomial series is a Chebychev series (implemented, e.g., as Cheby in QuantumPropagators.jl). The main point is, you would generally want to evaluate this using matrix-vector products only, and for the case of the G and the associated “padded” vector, you would then use that

G \begin{pmatrix}|\Psi_n\rangle \\ |\Psi\rangle\end{pmatrix} = \begin{pmatrix} H |\Psi_n\rangle + H_n |\Psi\rangle \\ H |\Psi\rangle \end{pmatrix}

So this is something you can evaluate without ever instantiating G or the extended vector (you just keep track of the two halves of the extended vector in separate variables).

I hope I didn’t screw up anything writing down these equations, but the full details (with a slightly more general notation) are in section 2.3 of one of my papers: Quantum Optimal Control via Semi-Automatic Differentiation, Quantum 6, 871 (2022)

For the rest of J you just follow the chain rule which makes you end up with Eq. (11) in that paper (where the gradient of \tau is what we just went through, for all the parameters \epsilon_n. The paper skips over that you can process these in parallel, so the formulas there keep all of the \epsilon_n.

This whole procedure keeps working through “time ordering”, i.e., if H has an explicit time dependency and the U is no longer just a simple matrix exponential, but the entire numerical procedure for simulating the time dynamics of |\Psi(t)\rangle. That’s something you’d probably do with the ODE solvers in DifferentialEquations. These of course also only use matrix-vector multiplication, so you would give it G as the (abstract) matrix and the “padded” vector (0, \Psi) as the state vector, and you get out the “gradient vectors” |\Psi_n\rangle that you can then use via a relatively simple chain rule (or maybe via AD if your J is non-analytical; that’s basically what that paper is about).

It also extends to density matrices in open quantum systems, for what that’s worth (nothing changes, only the notation).

Maybe calling all of this “forward mode-AD” was misleading, although I do think there’s a pretty strong connection. But I’m quite sure that this whole procedure is the numerically most efficient way to handle this kind of problem. If someone has a way to improve on that, we should sit down and write a paper together, which, I think, would make a pretty big splash ;-).

The runtime tends to be somewhere between 1-2 times the cost of evaluating U (less than 2 because the derivative H_n is typically more sparse than the full H), times the number of parameters \epsilon_n. But those, you can parallelize, so it still scales well in terms of wallclock time. Especially, since the number of parameters in these quantum control problems or variational quantum circuits tend to be relatively small, and the numerical effort is dominated by the size of the Hilbert space and the cost of simulating the dynamics.

As for reverse-mode AD (“full AD” as we call it in that paper), that’s been tried, of course. Unfortunately, for the problems in quantum control, the memory overhead associated with keeping track of the computational graph makes this not scale well. And, you also tend to lose a lot runtime performance because you have to contort yourself to the limitations of most AD frameworks (no in-place operations, maybe no support for complex vectors). This was observed repeatedly even by the people who introduced the use of AD to quantum control, and also in section 5.3 of the paper, we have some benchmarks to that effect.

1 Like

In the original problem here, @stone44 is computing a trace. While there are iterative trace-estimation methods (e.g. Hutchinson’s method) that only require (many) matrix–vector products that can be computed iteratively (or using even better methods for v^T M v products, ala stochastic Lanczos quadrature, …) they are likely to be slower than dense methods until the matrices become fairly large. Not sure how big the actual matrices in question are here?

This scaling with the number of parameters is the problem I was pointing out — it’s the generic difficulty with all forward-mode differentiation algorithms.

With a reverse-mode algorithm, it should be possible to have a cost that is almost independent of the number of parameters. (That’s definitely the case for dense methods using the formulas linked above.) In the case of an iterative exp–vec calculation, no simple formula is occurring to me at the moment, other than the “brute-force reverse-mode” of backpropagating through the steps of the iterative exp–vec solver (which has memory overhead as you noted).

You don’t need to use AD to use reverse mode. You can implement it manually (that’s what most of the people in computational engineering have done for decades now, where they call it an “adjoint” method). When you implement it manually you can often take better advantage of the structure of your problem.

Forward and reverse modes simply refer the direction in which you apply the chain rule: from inputs to outputs (forward) or from outputs to inputs (reverse) … or even some mixture of the two.

1 Like

First of all @stevengj @goerz I’d like to thank you for replying in such a detailed way and all the input, I am currently digging throught all that, that’s why I hesitated replying. Thanks to @apkille for the info aswell. I am writing this reply to answer the question on the dimension of my matrices. The size of the matrices range from 16 \times 16 to 256 \times 256. I will comment on all the input you provided once i dug though all that. Thanks again for all that, you are of tremendous help.

Glad i could provide inspiration for a homework sheet :smiley:

You’re not wrong. The two issues that come into play in practice is the memory-overhead of reverse AD, and the general limitations of having to do everything inside of an AD framework.

For the former, as you point out, forward-AD basically trades that memory overhead for the additional runtime overhead of having to do the calculation once for every parameter. Unfortunately, when you have large vectors (“large Hilbert space”) and/or deep computational graphs, this overhead can become absolutely prohibitive, and you just can’t use reverse-AD anymore. Large Hilbert spaces come up whenever you try to simulate anything that involves spatial motion, in quantum information when you have a large number of qubits (exponential scaling), or just for open quantum systems (additional quadratic scaling). The “deep computational graphs” would be if you have a large time grid. The way the OP described the problem, they’re looking at a single time step, so that’s decided not a “deep computational graph”. So they’re more than likely totally fine to use reverse-mode AD, and numerical scaling doesn’t matter anyway in that scenario, everything will be running in less than a second of wallclock time.

Which goes to the second practical issue, having to do the entire simulation inside the AD framework. Originally, the AD frameworks were really designed for machine learning, which only uses real vectors and would have problems with the complex vectors of quantum mechanics. There’s also the lack of in-place operations, although that also only becomes relevant for large Hilbert spaces. I’ve done okay with Zygote, although I have also run into the famous correctness bugs (the gradient of a determinant of a complex matrix was completely and silently wrong at the time). And of course, this very example of Enzyme not doing matrix exponentials. Things are getting better, but as soon as you’re forced to work inside of AD, there tend to be headaches. If there’s a way to just evaluate a gradient quasi-analytically, I’d always prefer that over having to use an AD framework. In particular because this particular gradient appears throughout quantum control and is thus good to have hard-coded. You can still combine it with AD (any kind of AD) for when the ultimate function J does not have an easy analytic gradient (the “semi”-AD approach).

You don’t need to use AD to use reverse mode. You can implement it manually […] they call it an “adjoint” method

Yeah, you can do that in quantum control as well (and I’ve done it to some extent). It’s a little more complicated to implement, and still requires a significant amount of memory. Of course it we’re getting more and more memory, so the point at which that really becomes a problem very much depends. Probably not in all but the most large-scale problems you would encounter in industry, definitely not what you’d have to deal in teaching or even research)

1 Like

the size of my Hamiltonian ranges from16 × 16 to 256 [× 256]

Yeah, especially if you’re only having a single time step, you’re not really in a situation where you need to worry too much about the numerical scaling for reverse-mode vs forward-mode AD, or the efficiency of evaluating the application of the exponential.

Although already if you’re going beyond 100 × 100, I’d avoid explicitly exponentiating any matrices.

But fundamentally, you should be okay with using AD, if you can get it to work. This is just a particular Enzyme problem. Have you tried the more established Zygote instead?

Maybe I should add something: I am minimizing a cost function, for i which I need to evaluate expectation values for multiple observables \approx 100 - 500 times.
Yes I am currently using Zygote, but since it does not allow mutating I am looking for another way to reduce allocations and increase performance.

You can still sometimes use reverse-mode differentiation with large state (not necessarily via AD!!), but you have to use specialized tricks. Similar issues arise when trying to backpropagate through very large systems of ODEs (e.g. discretized PDEs). One trick is “checkpointing” where you don’t save all of the state, and instead recompute some of it on the fly, trading off compute cost for memory. Another is to apply lossy compression to the saved state, because often you don’t need gradients to machine precision.

You don’t have to use AD! (Indeed, if you are comparing manually implemented and optimized forward-mode algorithms to AD reverse mode, it’s not surprising that the latter looks bad.)

My experience is that, for any sufficiently complicated problem in science or engineering, you will always have to implement at least some portion of the chain rule (i.e. some vector–Jacobian products) manually. Either because the AD fails completely or because it is grossly inefficient (not exploiting some structure of your problem).

If you were to use reverse mode, it would be crucial to start the backprogation at your scalar cost function, not to differentiate each observable separately and then add the gradients into the cost gradient.

Not if you’re computing a trace. I would be shocked if iterative trace estimation via iterative exponential–vector products were faster than dense-matrix calculations for a 256x256 matrix.

This is especially true if you need gradients with respect to many parameters, since backpropagation/reverse methods for dense matrix exponentials are quite efficient.

Yes, that is one option I’ve played around with

Another is to apply lossy compression to the saved state, because often you don’t need gradients to machine precision.

That’s an option I have not played around with, but actually a very interesting idea!

Yeah, you’re probably right in the context of the trace and AD. I was thinking more generally in the simulation of the dynamics. Some tentative benchmarks I made recently is a comparison between calculating exp(i H dt) |Ψ⟩ by first doing the matrix exponential and then multiplying the resulting matrix to |Ψ⟩ vs doing it by an iterative expansion in Chebychev polynomials for dense H. That crossover point is definitely way below a dimension of 100.

On a separate note, it’s better to do U = cis(-H) than to call exp, since cis will exploit the Hermitian structure of H, especially if you wrap it in Hermitian(H).

(There’s an open issue to add analogous support to ExponentialUtilities.jl.)

3 Likes

Yeah that’s just an error message saying that we hit a LAPACK call that we haven’t implemented the derivative for.

There’s no reason not to have it (and we really should add it, or the matrix expontential directly) other than just other infrastructure needs coming up and this not being as common in the codes I’ve seen.

Contributions welcome!

That said, you can probably use import_rrule (API reference · Enzyme.jl) to import the Matrix exponential rule into Enzyme from Chainrules (though it may not be as performant of an Enzyme-native rule). It will let the rest of the system be Enzyme and thus support mutation/GPU/other high performance things.

If you really want to reduce allocations, I’d recommend compiling your code with Reactant.jl (GitHub - EnzymeAD/Reactant.jl), which essentially removes type instabilities, temporary allocations, fuses things together, and naturally is compatible with enzyme.

I know it’s used in some tensor network stuff (cc @mofeing) but I don’t know if we added the matrix exponential to it yet.

Oh one other small thing, if your matricies are small like here, you can just use StaticArrays which doesn’t call LAPACL and works fine here with Enzyme (and ought be much faster too).

julia> function expect(x)
           ρ = 1/2*@SMatrix [1 0; 0 1] 
           H = (@SMatrix [x[1] 0;0 -x[1]]) + (@SMatrix [0 x[2];-x[2] 0]) #Hamiltonian
           U = exp(-1im*H) 
           O = @SMatrix [1 0; 0 -1] #some observable
           ρ_evolved = U*ρ*U'
           return real(tr(O*ρ_evolved)) #expectation value
       end
expect (generic function with 1 method)

julia> res = Enzyme.gradient(Reverse, expect, @SVector rand(2))
([3.59619990778441e-18, 0.0],)

we don’t yet support matrix exponential because that requires eigendecomposition and we don’t yet support matrix factorizations (but it’s around top 3 of my priorities)

2 Likes