Conceptual Gaussian Process Questions

Hi! This is more of a theoretical tangent, so forgive me if this is a bit off topic. I’ve recently finished my undergrad in Statistics and have been trying to navigate the US job market, to fairly dismal results. Anyway, I’ve tried to pick up some cool new statistical nuggets in the meantime, especially those I didn’t see in school, like Gaussian Processes.

I’ve been trying to work through Rasmussen and Williams, but I feel like I’m not getting a great conceptual idea of how Gaussian Processes fit in with the rest of Bayesian hierarchical models. I understand that by using a kernel function to generate a covariance matrix, we don’t need to specify the form of our model beforehand for more modeling flexibility. In a previous Bayesian course, I saw that we can compare the effectiveness of multiple models by creating a large hierarchical model with an extra level for “model”. Is there a connection with GP’s here or am I off base?

I suppose I (perhaps incorrectly) always thought of Bayesian linear regression models and the like as a distribution over possible functions, so the emphasis on this in Bayesian Processes is confusing to me. Is this a property unique to GP’s and I’ve been misinterpreting other models? Or is this just a connection to the larger Bayesian methods?

Last one; A good number of blogs online seem to define the functions over which the GP describes the distribution on, as smooth and continuous. Rasmussen and Williams seems to make note that this is not the case, and that when generating functions from the prior we simply generate a random vector from the GP. I suppose to me it seems GP’s place a distribution over a discrete collection of points, with a (countably) finite domain and I don’t understand how we can make predictions over points not listed in the domain.

I hope to utilize Steno + Soss in an NLP project I’m working on, but I figure I should get a decent understanding before jumping in.

Thanks in advance for reading through my super newbie questions!

P.S. All the PPL and GP presentations from JuliaCon were great this year!


Bayesian linear regression aims to express a distribution over parameterized functions by having a posterior over the parameters of a linear regression model. Seen in the function space this can be understood as a distribution over linear functions.
Gaussian processes go a step further and provide a non-parametric solution by directly defining a distribution over functions, not over parameters.

This is generally the case but it doesn’t have to be the case. A GP can, in certain cases, have discontinuities.

If you draw a sample from a GP you effectively draw from the marginal distributions at some particular points. In a GP, the marginals are multivariate Gaussian distributed as this is the definition of a sound GP. The GP however is a stochastic process on the possibly uncountable domain. In general the domain of a GP can be arbitrary.

I don’t think you can use Steno in Soss but you can use Steno together with Turing.

1 Like

I don’t think this is right, I believe @willtebbutt got this working.


julia> using Soss, Stheno

julia> m = @model x begin
                  log_l ~ Normal(0, 1)
                  f = GP(stretch(EQ(),exp(-log_l)), GPC())
                  y ~ f(x, 0.01)

julia> x = randn(10);

julia> y = sin.(x) .+ 0.1 .* randn(10);

julia> rand(m(x=x))
(x = [-0.9103139475986523, 1.817173481775971, 0.8110679477795716, 0.05826537668307032, 0.8098361685429899, -0.5650524302202036, 0.8986258112915756, -1.918865787719517, 0.21018632417084698, 1.041718487568655], f = GP{Stheno.ZeroMean{Float64},Stheno.Stretched{Array{Float64,1},EQ,typeof(identity)}}(Stheno.ZeroMean{Float64}(), Stheno.Stretched{Array{Float64,1},EQ,typeof(identity)}([0.7555174626090574], EQ(), identity), 1, GPC(1)), log_l = 0.2803523835602159, y = [-0.06851178772548096, 1.4405539183483282, 0.12697038831557717, -0.585918914921409, 0.10589557901506293, -0.2760916532195983, 0.246168233887542, -0.6289652826124084, -0.6852695918761627, 0.3596390427783692])

julia> post = dynamicHMC(m(x=x),(y=y,)) |> particles
(log_l = 0.524 ± 0.28,)

Wow I didn’t expect answers from the big guns! Thanks for the clarifications @trappmartin and @cscherrer. I think I’ve got a more solid understanding of how GP’s work now, and hopefully using stheno with turing or soss can help me more fully understand how GP’s fit in with larger Bayesian modeling. Any more resources or exercises you two recommend?

It’s a shame I never realized how much a covariance structure could describe relationships in data in school. The fact that it can be manipulated for modeling was a bit of an epiphany for me.

There’s some good material in Gelman et al’s Bayesian Data Analysis. There’s also a new book called Regression and Other Stories. I haven’t seen it yet and don’t know if it covers GPs, but it’s also Gelman, Hill, and Vehtari, so I expect it’s probably a good one for general Bayesian modeling.

EDIT: I just remembered that @willtebbutt also included some nice examples of use with Soss in the Stheno docs:

Epiphany days are the best :slight_smile:

1 Like

Cool, I wasn’t aware of this. Sorry for the mistake from my side.

1 Like

This is generally the case but it doesn’t have to be the case. A GP can, in certain cases, have discontinuities.

Probably better to say that this is “often” the case, as the most popular class of kernels – the Exponentiated Quadratic kernels (a.k.a. RBF / Squared Exponential / “Gaussian”) – do indeed distribute over smooth differentiable functions, but there are plenty of kernels that give you different classes of functions. For example the Exponential / Matern-1/2 (/ OU process) kernels produce sample paths that are continuous everywhere, but differentiable nowhere, while the Matern-3/2 kernel provides sample paths that are once-differentiable.

Any more resources or exercises you two recommend?

This is less about the overall picture and more about generally understanding GPs / Bayesian inference over their hyperparameters, but if you’ve not already seen it, I always recommend this talk.


No GPs except for a simple demonstration without much explanation. ROS was already 550 pages, so GPs were left for the next book in progress.


Hi Aki, welcome to the Julia Discourse!

1 Like

Looking over BDA certainly helped piece things together in the larger modeling framework. I’d read through a few chapters in a previous course, but I must’ve missed the section on non-parametrics.

Also, thanks for replying Aki! I took the advice listed in your class notes linked w/ BDA and did some work in Statistical Rethinking to ease myself in. I really appreciated the the connections drawn by McElreath between GP’s and hierarchical modeling. Afterwards, BDA’s non-parametric section has been clearer ( to me anyway ) than much of the material I’ve already seen.


I couldn’t resist signing in as I wanted to click like for the nice Soss, Stheno GP code example!


Just so that Turing gets some air-time in this thread, here’s a Turing + Stheno example that I wrote other day here:


I actually found that post quite helpful and I used that model! I was also looking back at JuliaCon 2019 and noticed Stheno + Turing Poisson regression. Did you make any headway with other likelihoods in Turing or Soss? I really like the PPL + GP interface, and would love to see this happen. I’d be more than happy to try help get this going, but I’m not sure how useful I’d be.

There’s nothing really stopping you using other likelihoods at the minute. IIRC I was having some AD-issues at the time, but they’re resolved now.

The question of how efficient NUTS will be on a given problem remains, but in principle you should just be able to compose stuff.

If anything isn’t working, I would very much like to hear about it!


@nucklass There is a specific application of Gaussian Processes to time-series (or other 1-D data) in which the GP describes the combination of a correlated “noise” component and white noise component. We discuss this pedagogically in detail and give examples with application to astronomical time series here: Bayesian inference can be used to infer the parameters of the “mean-model” (which is a deterministic model), in addition to the parameters of the kernel, which describe the covariance matrix. So, the GP doesn’t describe a model (in the usual sense), but describes the correlations between data points with some separation (in our case time, but this can be generalized to spatial or spectral dimensions as well). Our physically-motivated harmonic oscillator kernels can be used for a wide variety of stationary GPs (in which the kernel parameters are independent of time).