For our SumProductTransformation networks https://arxiv.org/abs/2005.01297, we have created an invertible “Dense” transformations usual in neural networks. Our version features efficient inversion and efficient calculation of a determinant of a Jacobian (of course only where this operation makes sense, i.e. it is from a R^d → R^d). Our approach (detailed in the paper) relies on a representation and optimization of a dense in an SVD decomposed form, for which we needed differentiable parametrisation of the group of Unitary matrices. We have separated this functionality to a separate repo, which is now registered and you can freely use it with Flux / Zygote. https://github.com/pevnak/Unitary.jl.
For implementation of Dense layer, see https://github.com/pevnak/SumProductTransform.jl/blob/master/src/layers/svddense.jl, whichi roughly implements Bijectors.jl
interface.
I would be happy if someone founds a use of this. We are currently working on a GPU version, but it will takes a bit of time. Meanwhile, reach me for possible enhancements or questions.
Tomas
17 Likes
This is great! I was just starting to do almost exactly Unitary.jl for a project right now, perhaps I won’t have to.
1 Like
Cool package! I know absolutely nothing about the ML context, but just some remarks:
-
It’s pretty confusing to call these unitaries if you’re only doing real matrices. Why don’t you call them orthogonal?
-
You can get other parametrizations by using exponential mappings. They should probably be better in the sense that they deform the metric less.
-
If I understand correctly it appears you just want to optimize over orthogonal matrices. Optim.jl and Manopt.jl support this. But if I understand correctly your approach is only quadratic scaling, while the Riemannian optimization methods are usually cubic. I guess that’s the main point? CC the manifold people @kellertuer @mateuszbaran
You can get other parametrizations by using exponential mappings.
I’m not sure it’s different from what they are doing. The Givens rotation is exp(A)
where A
is skew-symmetric with only a single nonzero upper-triangular entry, i.e., a basis for SO(n)
.
1 Like
Right, I was thinking in the Riemannian optimization setting where you take the exponential (or some other exponential-like mapping) of a full skew symmetric matrix (the gradient of the objective function projected onto the tangent plane)
That’s interesting. I guess this approach should be faster than something similar to typical manifold gradient-based optimization? That is using the standard matrix representation of orthogonal matrices, projecting Euclidean gradient to Riemannian gradient and applying a retraction. Though I don’t know, maybe using QR retraction instead of exact exp
would be faster and accurate enough here.
Would it make sense to try using matrix in QR decomposed form instead of SVD?
Actually, we have used QR as well, but was experimentally better in our application. I believe that it was partially caused by the fact that when angle in givens rotation is zero, it represents a diagonal matrix, which is nice in machine learning.
1 Like
For a Unitary (or Orthogonal) matrix of dimension d, we use precisely d^(d-1)/2 parameters. The multiplication of matrix by this vector requires 4 times more multiplications and two times more additions (besides sin and cos).
I am sorry for confusing name. I guess I cannot change it, since the package is registered.