Hi all,
We’ve recently uploaded to ArXiv a theoretical analysis for ADVI (although the title says black-box VI), which I believe is the first formal full convergence proof for any black-box VI-type algorithm that uses SGD:
[2305.15349] Black-Box Variational Inference Converges (arxiv.org) .
What’s interesting is that the analysis reveals some very unexpected properties of the covariance parameterization that we use in practice. In particular, if we use non-linear transformations for the diagonal elements (as done for the mean-field parameterization in Stan and Turing), such as
L_{ii} = \exp(\ell_i),
where \mathbf{L} = \mathrm{diag}\left(L_{11}, \ldots, L_{dd}\right) is the (diagonal) Cholesky factor for the variational approximation, we provably lose speed. One could have achieved an \mathcal{O}(1/T) converge rate for nice posteriors but only gets \mathcal{O}(1/\sqrt{T}) instead. And if one does the same for full-rank parameterizations, as PyMC3, but not Stan, the ELBO might not even be convex even if the posterior is log-concave!
In the context of Julia’s probabilistic programming ecosystem, this means using positive_definite
from ParameterHandling.jl
to construct variational approximations is suboptimal.
In our experiments, we indeed observe that the mean-field parameterization without any exp
or softplus
transformation, for enforcing the scale to be positive, converges the fastest. To me, this is one of those rare occasions where optimization theory precisely tells you what happens in practice, which is not that common, unfortunately.
Please let me know if you have any comments or questions.
10 Likes
Interesting work, thanks for posting that here.
From a quick glance it seems that the diagonal components of the Cholesky factor are linear in the underlying parameter in ParameterHandling.jl, which if I understand the paper correctly is what seems to achieve the fastests convergence. But maybe I’m missing the point. Could you point out what is suboptimal in the code or maybe open an issue?
2 Likes
Oh that’s true. I have no idea why I thought that it enforced positive-definiteness. I think I may have mixed it up with positive
. Thanks for pointing this out!
But, wouldn’t the fact that it does not enforce positive-definiteness be misleading? Maybe a name like lower_triangular
would be more appropriate.
1 Like
I think this is something that could be discussed. Maybe adding an optional second argument positive_definite(X, strict=false)
is a non-breaking way of going about this.
But I’m thinking that given your results would there be any benefit in using a non-linearity for the diagonal elements in practice? In my applications (GPs) we always add some jitter anyway.
Indeed, our paper shows that it shouldn’t be necessary for VI. But I can imagine people trying to use positive_definite
for other applications where PD-ness must be enforced and get confused.
Thanks for the suggestions! Didn’t know Frank-Wolfe could be used to deal with non-smoothness. But would it have theoretical/practical benefits for this application? Excuse my lack of knowledge of Frank-Wolfe.
I don’t think the non-smoothness would be needed when using FW. If I understood your paper correctly (only briefly skimmed through), you are modelling the constraints using a non-smooth objective instead of using domain transformation. FW gives you a third choice where you can ensure the optimiser always remains in the feasible domain without bringing the constraints up in the objective as a non-smooth term. Modelling the problem as a conic-constrained (stochastic) nonlinear optimization problem will allow (stochastic) FW to respect these conic constraints, e.g. non-negativity or positive semi-definite constraints, without introducing non-smooth objectives.
1 Like
Oh, I see where you’re coming from. Yes, your understanding of our paper is spot on. The problem with constraining the domain is that choosing the constraints is not straightforward. One needs to set a lower bound on the variance of the variational approximation, which means we need to have a good guess of the posterior variance.
1 Like
0? Or in the dense case, positive definite.
The smoothness of the entropy term scales as 1/\sigma. So to get a smooth loss function you actually need a strictly positive lower bound. The problem is how small must that positive constant be.