New Theoretical Analysis for ADVI / BBVI

Red-Portal · May 28, 2023, 7:31pm

Hi all,

We’ve recently uploaded to ArXiv a theoretical analysis for ADVI (although the title says black-box VI), which I believe is the first formal full convergence proof for any black-box VI-type algorithm that uses SGD:

[2305.15349] Black-Box Variational Inference Converges (arxiv.org) .

What’s interesting is that the analysis reveals some very unexpected properties of the covariance parameterization that we use in practice. In particular, if we use non-linear transformations for the diagonal elements (as done for the mean-field parameterization in Stan and Turing), such as

L_{ii} = \exp(\ell_i),

where \mathbf{L} = \mathrm{diag}\left(L_{11}, \ldots, L_{dd}\right) is the (diagonal) Cholesky factor for the variational approximation, we provably lose speed. One could have achieved an \mathcal{O}(1/T) converge rate for nice posteriors but only gets \mathcal{O}(1/\sqrt{T}) instead. And if one does the same for full-rank parameterizations, as PyMC3, but not Stan, the ELBO might not even be convex even if the posterior is log-concave!

~~In the context of Julia’s probabilistic programming ecosystem, this means using positive_definite from ParameterHandling.jl to construct variational approximations is suboptimal.~~

In our experiments, we indeed observe that the mean-field parameterization without any exp or softplus transformation, for enforcing the scale to be positive, converges the fastest. To me, this is one of those rare occasions where optimization theory precisely tells you what happens in practice, which is not that common, unfortunately.

Please let me know if you have any comments or questions.

simsurace · May 29, 2023, 8:54pm

Interesting work, thanks for posting that here.

From a quick glance it seems that the diagonal components of the Cholesky factor are linear in the underlying parameter in ParameterHandling.jl, which if I understand the paper correctly is what seems to achieve the fastests convergence. But maybe I’m missing the point. Could you point out what is suboptimal in the code or maybe open an issue?

Red-Portal · May 29, 2023, 9:33pm

Oh that’s true. I have no idea why I thought that it enforced positive-definiteness. I think I may have mixed it up with positive. Thanks for pointing this out!

But, wouldn’t the fact that it does not enforce positive-definiteness be misleading? Maybe a name like lower_triangular would be more appropriate.

simsurace · May 30, 2023, 2:29pm

I think this is something that could be discussed. Maybe adding an optional second argument positive_definite(X, strict=false) is a non-breaking way of going about this.
But I’m thinking that given your results would there be any benefit in using a non-linearity for the diagonal elements in practice? In my applications (GPs) we always add some jitter anyway.

mohamed82008 · May 30, 2023, 8:53pm

Instead of proximal SGD, could be worth trying stochastic Frank Wolfe here GitHub - ZIB-IOL/FrankWolfe.jl: Julia implementation for various Frank-Wolfe and Conditional Gradient variants. CC: @mbesancon

Red-Portal · May 30, 2023, 10:39pm

Indeed, our paper shows that it shouldn’t be necessary for VI. But I can imagine people trying to use positive_definite for other applications where PD-ness must be enforced and get confused.

Red-Portal · May 30, 2023, 10:40pm

Thanks for the suggestions! Didn’t know Frank-Wolfe could be used to deal with non-smoothness. But would it have theoretical/practical benefits for this application? Excuse my lack of knowledge of Frank-Wolfe.

mohamed82008 · May 30, 2023, 10:55pm

I don’t think the non-smoothness would be needed when using FW. If I understood your paper correctly (only briefly skimmed through), you are modelling the constraints using a non-smooth objective instead of using domain transformation. FW gives you a third choice where you can ensure the optimiser always remains in the feasible domain without bringing the constraints up in the objective as a non-smooth term. Modelling the problem as a conic-constrained (stochastic) nonlinear optimization problem will allow (stochastic) FW to respect these conic constraints, e.g. non-negativity or positive semi-definite constraints, without introducing non-smooth objectives.

Red-Portal · May 30, 2023, 11:19pm

Oh, I see where you’re coming from. Yes, your understanding of our paper is spot on. The problem with constraining the domain is that choosing the constraints is not straightforward. One needs to set a lower bound on the variance of the variational approximation, which means we need to have a good guess of the posterior variance.

mohamed82008 · May 31, 2023, 12:30am

0? Or in the dense case, positive definite.

Red-Portal · May 31, 2023, 1:37am

The smoothness of the entropy term scales as 1/\sigma. So to get a smooth loss function you actually need a strictly positive lower bound. The problem is how small must that positive constant be.

Topic		Replies	Views
Variational inference for linear regression General Usage turing	0	505	December 30, 2020
Turing: "no method matching cholenksy" Probabilistic Programming	2	964	September 29, 2019
New package: ApproximateVI.jl Statistics package , announcement , bayesian-inference	5	611	January 28, 2023
Amortized Hierarchical Variational Model Probabilistic Programming flux , turing , machine-learning	4	462	September 17, 2020
User defined ELBO in AdvancedVI or Turing Statistics	1	77	July 19, 2025

New Theoretical Analysis for ADVI / BBVI

Related topics