How to use ParetoSmooth.jl

It looks like ParetoSmooth.jl is fairly new so the documentation is lacking examples right now. I’m wondering if someone with experience using the package can assist. Here’s a simple MWE:

using ParetoSmooth
using StatsBase
using Turing

X = rand(1.0:100.0, 30)
Y = @. 3.0 + 2.0 * X + rand(Normal(0, 10))

@model function test(X, Y, σᵦ)
    σ ~ Exponential(1)
    α ~ Normal(0, 1)
    β ~ Normal(0, σᵦ)
    μ = @. α + β * X
    Y ~ MvNormal(μ, σ)
end

Xₜ = standardize(ZScoreTransform, X)
Yₜ = standardize(ZScoreTransform, Y)

ch = sample(test(Xₜ, Yₜ, 1), IS(), 1_000) # Assuming I need to use IS() here

So I have my MCMC chain stored as ch. Now what? I’d like to calculate, for example, the leave-one-out cross validation score via psis_loo (documentation for which is here). How do I get the log_likelihood matrix?

1 Like

LOO is completely independent of the inference method, so no need to use IS() here. In fact, IS() seems to have some problems, so let’s use NUTS() instead.

You can retrieve the likelihood matrix either with Turing.pointwise_loglikelihoods or with ParetoSmooth.pointwise_log_likelihoods, which formats the matrix as needed by ParetoSmooth. But there’s also a convenience function that computes the log-likelihoods for you:

julia> ch = sample(test(Xₜ, Yₜ, 1), NUTS(), MCMCThreads(), 1_000, 4);
┌ Info: Found initial step size
└   ϵ = 0.21250000000000002
┌ Info: Found initial step size
└   ϵ = 0.30000000000000004
┌ Info: Found initial step size
└   ϵ = 0.225
┌ Info: Found initial step size
└   ϵ = 0.0125
Sampling (4 threads) 100%|█████████████████████████████████████████████████████████████████| Time: 0:00:00

julia> psis_loo(test(Xₜ, Yₜ, 1), ch)
[ Info: No source provided for samples; variables are assumed to be from a Markov Chain. If the samples are independent, specify this with keyword argument `source=:other`.
┌ Warning: Some Pareto k values are extremely high (>1). PSIS will not produce consistent estimates.
└ @ ParetoSmooth ~/.julia/packages/ParetoSmooth/A6x5U/src/InternalHelpers.jl:43
Results of PSIS-LOO-CV with 4000 Monte Carlo samples and 1 data points. Total Monte Carlo SE of 0.37.
┌───────────┬───────┬──────────┬───────┬─────────┐
│           │ total │ se_total │  mean │ se_mean │
├───────────┼───────┼──────────┼───────┼─────────┤
│   cv_elpd │  5.65 │      NaN │  5.65 │     NaN │
│ naive_lpd │  8.19 │      NaN │  8.19 │     NaN │
│     p_eff │  2.53 │      NaN │  2.53 │     NaN │
└───────────┴───────┴──────────┴───────┴─────────┘
4 Likes

Hi Seth,

Thanks so much for taking the time to help me out with this - I really appreciate it! :pray: :smiley:

I have a remaining doubt though regarding the values in the result. Since we are estimating model error, I would expect smaller values to be better, but the output of the following example is the opposite of what I would expect:

using DataFrames
using MLBase
using ParetoSmooth
using StatsBase
using StatsPlots
using Turing

X = rand(1.0:100.0, 30)
# True data-generating model is second-order polynomial!
Y = @. 3.0 + 2.0 * X + 0.05 * X^2 + rand(Normal(0, 40))
Xₜ = StatsBase.standardize(ZScoreTransform, X)
Yₜ = StatsBase.standardize(ZScoreTransform, Y)

@model function test(X, Y, o) # 'o' is for 'order'
    σ ~ Exponential(1)
    α ~ Normal(0, 1)
    β ~ MvNormal(o, 1)
    μ = α .+ sum(β .* [X.^i for i in 1:o])
    Y ~ MvNormal(μ, σ)
end

# PSIS_LOO
score1 = psis_loo(test(Xₜ, Yₜ, 1), sample(test(Xₜ, Yₜ, 1), NUTS(), 1_000))
score2 = psis_loo(test(Xₜ, Yₜ, 2), sample(test(Xₜ, Yₜ, 2), NUTS(), 1_000))
score3 = psis_loo(test(Xₜ, Yₜ, 3), sample(test(Xₜ, Yₜ, 3), NUTS(), 1_000))
score4 = psis_loo(test(Xₜ, Yₜ, 4), sample(test(Xₜ, Yₜ, 4), NUTS(), 1_000))

plot(
    1:4,
    # get the :cv_elpd and :naive_lpd values for each model
    [[eval(Symbol("score$i")).estimates[1,1] for i in 1:4] [eval(Symbol("score$i")).estimates[2,1] for i in 1:4]],
    labels=["Out-of-Sample" "In-Sample"],
    legend=:topleft,
    marker=:o
)

Output:

psis_example1

If I roll my own LOOCV function and then plot, it looks like this (result is what I expect for this example):

data = DataFrame(X=Xₜ,Y=Yₜ)

# leave one out cross validation
function loocv(o)
    N = size(data, 1)
    SE = []
    for s in LOOCV(N)
        chain = DataFrame(sample(test(data.X[s], data.Y[s], o), NUTS(), 1_000))
        x = data.X[setdiff(1:N, s)][1]
        y = data.Y[setdiff(1:N, s)][1]
        ŷ = mean([row.α + sum([row[Symbol("β[$i]")] * x^i for i in 1:o]) for row in eachrow(chain)])
        push!(SE, (ŷ - y)^2)
    end
    return sum(SE)
end

order = 4
err = [loocv(o) for o in 1:order]
plot(1:order, err, legend=false, marker=:o)

Output:

loocv_example

Finally, if I multiply the results from PsisLoo by negative 1, I get something that looks like what I would expect:

plot(
    1:4,
    [(-1 .* [eval(Symbol("score$i")).estimates[1,1] for i in 1:4]) (-1 .* [eval(Symbol("score$i")).estimates[2,1] for i in 1:4])],
    labels=["Out-of-Sample" "In-Sample"],
    legend=:topleft,
    marker=:o
)

Output:

psis_example2

So, this was just a really long way to ask if (relative to other models) bigger values in :cv_elpd and :naive_lpd are better :slightly_smiling_face:

Ah, I see the source of confusion. ParetoSmooth gives the following descriptions:

These are somewhat misleading, because ELPD is more a measure of predictive model accuracy, not error, with a specific definition of accuracy. “lpd” stands for “log predictive density”, and “elpd” stands for “expected log predictive density”. Generally, we are interested in ELPD, which, if a data point i is left out, is the log-likelihood of that data point averaged over the posterior we would get conditioning only on the other data points. This is what LOO estimates, and we want this to be maximized.

So, high ELPD is better. And you can check yourself that LOO estimates this by fitting the LOO posteriors, holding out each i at a time and computing the expected log-likelihood for each one.

You should also look at p_eff (effective number of parameters), which should in general not be larger than the number of parameters in the model (if it is, that could indicate misspecification). So while the difference between the ELPDs when comparing 3rd-order polynomials and 2nd-order polynomials is not much (an ELPD difference of less than 4 is considered small), we also find that the p_eff increases from less than 3 to over 4.

3 Likes

Fantastic. This is amazing, thanks again!

You’re welcome!

A bit late to the party here, but how would you get a meaningful comparison between different elpd loo scores for two different models fit to the same data? Especially with there being no standard error.

Section 5 of [1507.04544] Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC will likely be helpful for you. From the latter, the standard error of ELPD is computed treating the ELPD as a sum across data points, so in cases like this where there’s a single data point (i.e. the likelihood computed for each log density evaluation has a single term), the standard error will be 0 (ParetoSmooth returns it as NaN). So you can just compare the ELPDs of the two models with that in mind using the principles in Cross-validation FAQ • loo .

2 Likes

Oh perfect, thanks! That paper was exactly what I needed.

I’ll probably try to work out my own version of an loo elpd to make sure I completely understand it, but I have one initial question after a quick read of that paper. In this example our data has 30 independent observations. If the elpd_loo returned from psis_loo is defined like in the paper then is that value not the average of the \hat{elpd}_{loo,i} calculated for each i of those 30 data points? Should there not be a standard error like se(\hat{elpd_{loo}})=\sqrt{30V_{i=1}^{30}\hat{elpd_{loo,i}}}

Yes, I believe that’s correct. Are you seeing something different in what ParetoSmooth returns?