Multi-level varying slopes with two clusters, cross classification

Hello, I am working through StatisticalRethinking books with multi-level varying slopes with two clusters, cross classification model

193rfasd

2fasdqwdasd

I run the stan model which finished sampling about two minutes.

To build the model with Turing.jl, I refer to StatisticalRethinkingJulia/TuringModels.jl, however, the sampling time shows ETA 5 days. So I am not sure which part I did wrong. Furthermore, I also refer to performances tips of Turing Performance Tips to make the the types are stable. But the sampling time of ETA 5 days compared to stan’s 2 minutes indicate something is wrong with my model. I spent a few hours tweaking the model specifications but it did not help at all.

Stan code saved in “varyingSlopes.stan”

data{
    int L[504];
    int block_id[504];
    int actor[504];
    int tid[504];
}
parameters{
    vector[4] alpha[7];
    vector[4] beta[6];
    vector[4] g;
    corr_matrix[4] Rho_actor;
    corr_matrix[4] Rho_block;
    vector<lower=0>[4] sigma_actor;
    vector<lower=0>[4] sigma_block;
}
model{
    vector[504] p;
    sigma_block ~ exponential( 1 );
    sigma_actor ~ exponential( 1 );
    Rho_block ~ lkj_corr( 4 );
    Rho_actor ~ lkj_corr( 4 );
    g ~ normal( 0 , 1 );
    beta ~ multi_normal( rep_vector(0,4) , quad_form_diag(Rho_block , sigma_block) );
    alpha ~ multi_normal( rep_vector(0,4) , quad_form_diag(Rho_actor , sigma_actor) );
    for ( i in 1:504 ) {
        p[i] = g[tid[i]] + alpha[actor[i], tid[i]] + beta[block_id[i], tid[i]];
        p[i] = inv_logit(p[i]);
    }
    L ~ binomial( 1 , p );
}

To sample the model in R

library(rethinking)
data(chimpanzees)
d = chimpanzees
d$block_id = d$block
d$treatment = 1L + d$prosoc_left + 2L*d$condition

dat = list(
    L = d$pulled_left,
    tid = d$treatment,
    actor = d$actor,
    block_id = as.integer(d$block_id) )

varyingSlopes = stan_model("varyingSlopes.stan")

fit = sampling(varyingSlopes, data = dat, seed = 803214053, control = list(adapt_delta = 0.9))

The model finished sampling in about 2 minutes.

The turing model adapted from StatisticalRethinkingJulia/TuringModels.jl

# ## Data

using Pkg;
Pkg.activate(".");

import CSV;
import TuringModels;

using DataFrames;

data_path = joinpath(TuringModels.project_root, "data", "chimpanzees.csv");
df = CSV.read(data_path, DataFrame; delim=';');

df.block_id = df.block;
df.treatment = 1 .+ df.prosoc_left .+ 2*df.condition;

# ## Model

using Turing;

@model function gdemo(actor, block, treatment, pulled_left, ::Type{TV}=Matrix{Float64}) where {TV}
    ## fixed priors
    gamma ~ filldist(Normal(0, 1),4)
    Rho_block ~ LKJ(4,2)
    Rho_actor ~ LKJ(4,2)
    sigma_block ~ filldist(truncated(Cauchy(0,2),0,Inf), 4)
    sigma_actor ~ filldist(truncated(Cauchy(0,2),0,Inf), 4)

    ## adaptive priors
    Sigma_block = sigma_block .* Rho_block .* sigma_block'
    Sigma_block = (Sigma_block'+Sigma_block)/2
    Sigma_actor = sigma_actor .* Rho_actor .* sigma_actor'
    Sigma_actor = (Sigma_actor'+Sigma_actor)/2

    β = TV(undef, (4, 6))
    for i in 1:size(β, 2)
        β[:, i] ~ MvNormal(zeros(4), Sigma_block)
    end
    α = TV(undef, (4, 7))
    for i in 1:size(α, 2)
        α[:, i] ~ MvNormal(zeros(4), Sigma_actor)
    end

    logit_p = gamma[treatment] .+ α[treatment,actor] .+ β[treatment,block]
    pulled_left .~ BinomialLogit.(1, logit_p)
end;

## Output
julia> chns = sample(
           gdemo(df.actor, df.block_id, df.treatment, df.pulled_left),
           NUTS(0.95),
           1000
       )
┌ Info: Found initial step size
└   ϵ = 0.0125

Sampling   1%|▍                                 |  ETA: 5 days, 12:21:51

Hmm, I mean I see two for-loops that might not be fully necessary, but at the same time they only loop through very little.
Regardless, since there is no actual indexing used in the for-loops, I believe filldist should be possible to be used.
Something like:

α ~ filldist(MvNormal(zeros(4), Sigma_actor), 7)

Additionally, I’m not quite sure, but is logit_p really supposed to be a 504x504 matrix? Otherwise that might be an issue

1 Like

With so many parameters you should use ReverseDiff like so:

using ReverseDiff, Memoization
Turing.setadbackend(:reversediff)
Turing.setrdcache(true)
3 Likes

A few things. First, your Stan and Turing models aren’t the same. It looks like you’re using a different prior for the sigmas and the rhos than the Stan model.

Then, as pointed out by @michielver, the loops can be eliminated by using fill_dist. Additionally, the computation of logit_p allocates 4 vectors, but this could be done with just one allocation.

Lastly, you’re passing a higher adapt_delta of 0.95 to the Turing model; higher adapt_deltas cause slower exploration. adapt_deltas that high are really conservative and usually only necessary when diagnosing sampling problems. Turing defaults to 0.65, which may provide faster exploration but at the risk of more divergences, while Stan defaults to 0.8. Probably anything between 0.7 and 0.8 is fine.

I tested a few variants, and the below example, which includes the above suggestions, runs on my old-ish machine in ~15 minutes. This is slower than Stan for this example, but within the same order of magnitude.

using CSV, TuringModels, DataFrames, Turing, LinearAlgebra, Random, ReverseDiff, Memoization
Turing.setadbackend(:reversediff)
Turing.setrdcache(true)

data_path = joinpath(TuringModels.project_root, "data", "chimpanzees.csv");
df = CSV.read(data_path, DataFrame; delim=';');

df.block_id = df.block;
df.treatment = 1 .+ df.prosoc_left .+ 2*df.condition;

quad_form_diag(M, v) = Symmetric((v .* v') .* (M .+ M') ./ 2)

@model function gdemo(actor, block, treatment, pulled_left)
    σ_block ~ filldist(Exponential(1), 4)
    σ_actor ~ filldist(Exponential(1), 4)
    ρ_block ~ LKJ(4, 4)
    ρ_actor ~ LKJ(4, 4)
    g ~ MvNormal(4, 1)
    Σ_block = quad_form_diag(ρ_block, σ_block)
    Σ_actor = quad_form_diag(ρ_actor, σ_actor)
    β ~ filldist(MvNormal(Σ_block), 6)
    α ~ filldist(MvNormal(Σ_actor), 7)
    logit_p = broadcast(treatment, actor, block) do t, a, b
        return @inbounds g[t] + α[t, a] + β[t, b]
    end
    pulled_left .~ BinomialLogit.(1, logit_p)
end

rng = MersenneTwister(3702)
@time chns = sample(
    rng,
    gdemo(df.actor, df.block_id, df.treatment, df.pulled_left),
    NUTS(1_000, 0.9),
    1_000
);
4 Likes

Would be interesting to know why Stan is faster. They for sure have had a lot of time to perfect the sampler. As far as I understand Stan has so many improvements and hacks of the NUTS that it can hardly be compared with vanilla NUTS anymore. I don’t really know the Turing implementation but I would assume it at least started with vanilla NUTS? This could be one of the major reasons for the slowdown.

1 Like

Perhaps a Turing dev could give a more specific answer, but I can see a few possible reasons. First, the Stan model may be completely non-allocating. e.g. their quad_form_diag may allow for computation of the multivariate normal density without computing the covariance matrices. Second, Stan and Turing (via AdvancedHMC.jl) should be using algorithmically equivalent adaptive HMC (aka NUTS) implementations, but implementation details can have a big impact, and each seed will yield different runtimes. But third, and perhaps more importantly, there are design trade-offs between Turing and Stan. Turing is designed so that arbitrary Julia code can be used in the model, and, while there are some Distributions-specific automatic differentiation (AD) rules, note that Distributions and ReverseDiff are general-purpose packages designed to work across the Julia ecosystem, not just for Turing. By having a smaller ecosystem, Stan developers are able to highly optimize every function and their AD rules, so in general, I would expect a model written in Stan to fit faster than an exactly equivalent Turing model. This can of course change; performance improvements to the HMC implementations (e.g. making the sampling almost non-allocating) and to the AD rules are almost certainly possible.

I completely agree that it would be good to know specific reasons why equivalent models are slower in Turing and work on improving the runtimes. 15 minutes to fit a model is a very different experience than 2 minutes.

6 Likes

Thanks @sethaxen @michielver @bdeonovic. I spent the whole weekend tweaking like 15 variants of the models but nothing worked. I got stucked in this part of the chapter for so many hours. Finally I can get my model working! I got similar sampling time of about 12 minutes compared to Stan’s about 2 minutes. At least now the efficiency is of the same order of magnitude. I learned many things from you guys! Particularly,

using ReverseDiff, Memoization
Turing.setadbackend(:reversediff)
Turing.setrdcache(true)
α ~ filldist(MvNormal(zeros(4), Sigma_actor), 7)
# alternatively
α ~ filldist(MvNormal(Sigma_actor), 7)

Thanks @sethaxes for the comprehensive explanation. I noted all of your comments, including the priors and adapt_delta. I did play around using Turing’s default adapt_delta = 0.65 and it does speed up by about 2 minutes. Furthermore, it seems

quad_form_diag(M, v) = Symmetric((v .* v') .* (M .+ M') ./ 2)

also helps to speed things a lot though I have no idea this line does. I just copied and pasted. And based on your advice

computation of logit_p allocates 4 vectors, but this could be done with just one allocation.

I understand @inbounds but also not sure what this line does. Again, I copied pasted into my model and it works. I need to read up some Julia docs.

logit_p = broadcast(treatment, actor, block) do t, a, b
    return @inbounds g[t] + α[t, a] + β[t, b]
end

I am also constructing another non-centered version of the model above, using cholesky factorisation to build up the variance-covariance matrix. Some summary of the data and key parts of the models I am trying to implement – some parts of the full model specified below

# Summary of key parts of Turing model. cannot be run

n_treatment, n_actor, n_block = 4, 7, 6
# adaptive NON-CENTERED priors
ρ_actor ~ LKJ(n_treatment, 2)
σ_actor ~ filldist(Exponential(1), n_treatment)

# I have no idea what does these two lines do. I just copied from the model of TuringModels.jl above
ρ_actor = (ρ_actor' + ρ_actor) / 2

# Cholesky factor
L_actor = cholesky(ρ_actor).L

# Standardised prior
z_actor ~ filldist(Normal(0, 1), n_treatment, n_actor)

α = σ_actor .* L_actor * z_actor # matrix 4 x 7

# The linear model
α[t,a] + # more terms not shown

# likelihood
pulled_left .~ BinomialLogit.(1, logit_p)

The Stan model, sampling 4 chains, finished in about 6 seconds per chain (total = Warm-up + sampling). Summarized elapsed time

Chain 1:  Elapsed Time: 3.27789 seconds (Warm-up)
Chain 1:                2.63135 seconds (Sampling)
Chain 1:                5.90924 seconds (Total)

Chain 2:  Elapsed Time: 3.27674 seconds (Warm-up)
Chain 2:                2.77611 seconds (Sampling)
Chain 2:                6.05285 seconds (Total)

Chain 3:  Elapsed Time: 3.44 seconds (Warm-up)
Chain 3:                2.66887 seconds (Sampling)
Chain 3:                6.10887 seconds (Total)

Chain 4:  Elapsed Time: 3.4264 seconds (Warm-up)
Chain 4:                2.63844 seconds (Sampling)
Chain 4:                6.06484 seconds (Total)

Stan model

Stan code saved in “nonCentered.stan”

data{
    int L[504];
    int block_id[504];
    int actor[504];
    int tid[504];
}
parameters{
    matrix[4,7] z_actor;
    matrix[4,6] z_block;
    vector[4] g;
    cholesky_factor_corr[4] L_Rho_actor;
    cholesky_factor_corr[4] L_Rho_block;
    vector<lower=0>[4] sigma_actor;
    vector<lower=0>[4] sigma_block;
}
transformed parameters{
    matrix[7,4] alpha;
    matrix[6,4] beta;
    beta = (diag_pre_multiply(sigma_block, L_Rho_block) * z_block)';
    alpha = (diag_pre_multiply(sigma_actor, L_Rho_actor) * z_actor)';
}
model{
    vector[504] p;
    sigma_block ~ exponential( 1 );
    sigma_actor ~ exponential( 1 );
    L_Rho_block ~ lkj_corr_cholesky( 2 );
    L_Rho_actor ~ lkj_corr_cholesky( 2 );
    g ~ normal( 0 , 1 );
    to_vector( z_block ) ~ normal( 0 , 1 );
    to_vector( z_actor ) ~ normal( 0 , 1 );
    for ( i in 1:504 ) {
        p[i] = g[tid[i]] + alpha[actor[i], tid[i]] + beta[block_id[i], tid[i]];
        p[i] = inv_logit(p[i]);
    }
    L ~ binomial( 1 , p );
}
generated quantities{
    vector[504] log_lik;
    vector[504] p;
    matrix[4,4] Rho_actor;
    matrix[4,4] Rho_block;
    Rho_block = multiply_lower_tri_self_transpose(L_Rho_block);
    Rho_actor = multiply_lower_tri_self_transpose(L_Rho_actor);
    for ( i in 1:504 ) {
        p[i] = g[tid[i]] + alpha[actor[i], tid[i]] + beta[block_id[i], tid[i]];
        p[i] = inv_logit(p[i]);
    }
    for ( i in 1:504 ) log_lik[i] = binomial_lpmf( L[i] | 1 , p[i] );
}

Sampling Stan model

library(rethinking)
data(chimpanzees)
d = chimpanzees
d$block_id = d$block
d$treatment = 1L + d$prosoc_left + 2L*d$condition

dat = list(
    L = d$pulled_left,
    tid = d$treatment,
    actor = d$actor,
    block_id = as.integer(d$block_id) )

nonCentered = stan_model("nonCentered.stan")

fit = sampling(nonCentered, data = dat, seed = 803214053, control = list(adapt_delta = 0.9))



fit = sampling(nonCentered, data = dat, seed = 803214053, control = list(adapt_delta = 0.9))

SAMPLING FOR MODEL 'nonCentered' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 0.000133 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 1.33 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1:
Chain 1:  Elapsed Time: 3.27789 seconds (Warm-up)
Chain 1:                2.63135 seconds (Sampling)
Chain 1:                5.90924 seconds (Total)
Chain 1:

SAMPLING FOR MODEL 'nonCentered' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 9.7e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.97 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2:
Chain 2:  Elapsed Time: 3.27674 seconds (Warm-up)
Chain 2:                2.77611 seconds (Sampling)
Chain 2:                6.05285 seconds (Total)
Chain 2:

SAMPLING FOR MODEL 'nonCentered' NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 9.7e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.97 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 3:
Chain 3:  Elapsed Time: 3.44 seconds (Warm-up)
Chain 3:                2.66887 seconds (Sampling)
Chain 3:                6.10887 seconds (Total)
Chain 3:

SAMPLING FOR MODEL 'nonCentered' NOW (CHAIN 4).
Chain 4:
Chain 4: Gradient evaluation took 0.000111 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 1.11 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 4:
Chain 4:  Elapsed Time: 3.4264 seconds (Warm-up)
Chain 4:                2.63844 seconds (Sampling)
Chain 4:                6.06484 seconds (Total)
Chain 4:

Turing’s reference model

To implement it in Turing, I found a similar model here StatisticalRethinkingJulia/TuringModels.jl - non-centered chimpanzees.md.

using TuringModels
using LinearAlgebra

# This script requires latest LKJ bijectors support.
# `] add Bijectors#master` to get latest Bijectors.

data_path = joinpath(@__DIR__, "..", "..", "data", "chimpanzees.csv")
delim = ";"
d = CSV.read(data_path, DataFrame; delim)

d.block_id = d.block

# m13.6nc1 is equivalent to m13.6nc in following Turing model

@model m13_6_nc(actor, block_id, condition, prosoc_left, pulled_left) = begin
    # fixed priors
    Rho_block ~ LKJ(3, 4.)
    Rho_actor ~ LKJ(3, 4.)
    sigma_block ~ filldist(truncated(Cauchy(0, 2), 0, Inf), 3)
    sigma_actor ~ filldist(truncated(Cauchy(0, 2), 0, Inf), 3)
    a ~ Normal(0, 1)
    bp ~ Normal(0, 1)
    bpc ~ Normal(0, 1)

    # adaptive NON-CENTERED priors
    Rho_block = (Rho_block' + Rho_block) / 2
    Rho_actor = (Rho_actor' + Rho_actor) / 2

    L_Rho_block = cholesky(Rho_block).L
    L_Rho_actor = cholesky(Rho_actor).L

    z_N_block ~ filldist(Normal(0, 1), 3, 6)
    z_N_actor ~ filldist(Normal(0, 1), 3, 7)

    # @show size(L_Rho_block) size(sigma_block) size(z_N_block_id)

    a_block_bp_block_bpc_block = sigma_block .* L_Rho_block * z_N_block
    a_actor_bp_actor_bpc_actor = sigma_actor .* L_Rho_actor * z_N_actor

    a_block = a_block_bp_block_bpc_block[1, :]
    bp_block = a_block_bp_block_bpc_block[2, :]
    bpc_block = a_block_bp_block_bpc_block[3, :]
    a_actor = a_actor_bp_actor_bpc_actor[1, :]
    bp_actor = a_actor_bp_actor_bpc_actor[2, :]
    bpc_actor = a_actor_bp_actor_bpc_actor[3, :]

    # linear models
    BPC = bpc .+ bpc_actor[actor] + bpc_block[block_id]
    BP = bp .+ bp_actor[actor] + bp_block[block_id]
    A = a .+ a_actor[actor] + a_block[block_id]
    logit_p = A + (BP + BPC .* condition) .* prosoc_left

    # likelihood
    pulled_left .~ BinomialLogit.(1, logit_p)
end

chns = sample(
    m13_6_nc(d.actor, d.block_id, d.condition, d.prosoc_left, d.pulled_left),
    Turing.NUTS(0.95),
    1000
)

The model I implemented to replicate Stan’s model above

To implement the Stan’s model and when doing this, I also refer to @sethaxen 's implementation above and the model of TuringModel.jl. Again, I coded up like more than 10 variants, but same issues

using TuringModels, CSV, DataFrames, Turing, LinearAlgebra, Random, ReverseDiff, Memoization

Turing.setadbackend(:reversediff)
Turing.setrdcache(true)

data_path = joinpath(TuringModels.project_root, "data", "chimpanzees.csv");
df = CSV.read(data_path, DataFrame; delim=';');

df.block_id = df.block;
df.treatment = 1 .+ df.prosoc_left .+ 2*df.condition;

@model function nonCentered(actor, block, treatment, pulled_left)
    n_treatment, n_actor, n_block = 4, 7, 6

    # fixed priors
    g ~ MvNormal(n_treatment, 1)
    ρ_block ~ LKJ(n_treatment, 2)
    ρ_actor ~ LKJ(n_treatment, 2)
    σ_block ~ filldist(Exponential(1), n_treatment)
    σ_actor ~ filldist(Exponential(1), n_treatment)

    # adaptive NON-CENTERED priors
    # I have no idea what does these two lines do. I just copied from the model of TuringModels.jl above
    ρ_block = (ρ_block' + ρ_block) / 2
    ρ_actor = (ρ_actor' + ρ_actor) / 2

    # Cholesky factor
    L_block = cholesky(ρ_block).L
    L_actor = cholesky(ρ_actor).L

    # Standardised prior
    z_block ~ filldist(Normal(0, 1), n_treatment, n_block)
    z_actor ~ filldist(Normal(0, 1), n_treatment, n_actor)

    # Again, copied from the model of TuringModels.jl above
    β = σ_block .* L_block * z_block # matrix 4 x 6
    α = σ_actor .* L_actor * z_actor # matrix 4 x 7

    # Use the code from @sethaxen
    logit_p = broadcast(treatment, actor, block) do t, a, b
        return @inbounds g[t] + α[t,a] + β[t,b]
    end
    pulled_left .~ BinomialLogit.(1, logit_p)
end

rng = MersenneTwister(3702)
@time chns = sample(
    rng,
    nonCentered(df.actor, df.block_id, df.treatment, df.pulled_left),
    NUTS(),  # Use Turing's default settings
    1_000
);

julia> @time chns = sample(
           rng,
           nonCentered(df.actor, df.block_id, df.treatment, df.pulled_left),
           NUTS(),  # Use Turing's default settings
           1_000
       );
┌ Info: Found initial step size
└   ϵ = 0.2
┌ Warning: The current proposal will be rejected due to numerical error(s).
│   isfinite.((θ, r, ℓπ, ℓκ)) = (true, false, false, false)
└ @ AdvancedHMC ~/.julia/packages/AdvancedHMC/MIxdK/src/hamiltonian.jl:47
┌ Warning: The current proposal will be rejected due to numerical error(s).
│   isfinite.((θ, r, ℓπ, ℓκ)) = (true, false, false, false)
└ @ AdvancedHMC ~/.julia/packages/AdvancedHMC/MIxdK/src/hamiltonian.jl:47
Sampling 100%|██████████████████████████████████| Time: 0:01:30
ERROR: DomainError with -5.831779503751022e-11:
sqrt will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
Stacktrace:
  [1] sqrt
    @ ./math.jl:582 [inlined]
  [2] sqrt
    @ ~/.julia/packages/ForwardDiff/QOqCN/src/dual.jl:203 [inlined]
  [3] derivative!
    @ ~/.julia/packages/ForwardDiff/QOqCN/src/derivative.jl:46 [inlined]
  [4] unary_scalar_forward_exec!(f::typeof(sqrt), output::ReverseDiff.TrackedReal{Float64, Float64, Nothing}, input::ReverseDiff.TrackedReal{Float64, Float64, Nothing}, cache::Base.RefValue{Float64})
    @ ReverseDiff ~/.julia/packages/ReverseDiff/E4Tzn/src/derivatives/scalars.jl:91
  [5] scalar_forward_exec!(instruction::ReverseDiff.ScalarInstruction{typeof(sqrt), ReverseDiff.TrackedReal{Float64, Float64, Nothing}, ReverseDiff.TrackedReal{Float64, Float64, Nothing}, Base.RefValue{Float64}})
    @ ReverseDiff ~/.julia/packages/ReverseDiff/E4Tzn/src/derivatives/scalars.jl:81
  [6] forward_exec!(instruction::ReverseDiff.ScalarInstruction{typeof(sqrt), ReverseDiff.TrackedReal{Float64, Float64, Nothing}, ReverseDiff.TrackedReal{Float64, Float64, Nothing}, Base.RefValue{Float64}})
    @ ReverseDiff ~/.julia/packages/ReverseDiff/E4Tzn/src/tape.jl:82
  [7] ForwardExecutor
    @ ~/.julia/packages/ReverseDiff/E4Tzn/src/api/tape.jl:76 [inlined]
  [8] (::FunctionWrappers.CallWrapper{Nothing})(f::ReverseDiff.ForwardExecutor{ReverseDiff.ScalarInstruction{typeof(sqrt), ReverseDiff.TrackedReal{Float64, Float64, Nothing}, ReverseDiff.TrackedReal{Float64, Float64, Nothing}, Base.RefValue{Float64}}})
    @ FunctionWrappers ~/.julia/packages/FunctionWrappers/8xdVB/src/FunctionWrappers.jl:58
  [9] macro expansion
    @ ~/.julia/packages/FunctionWrappers/8xdVB/src/FunctionWrappers.jl:130 [inlined]
 [10] do_ccall
    @ ~/.julia/packages/FunctionWrappers/8xdVB/src/FunctionWrappers.jl:118 [inlined]
 [11] FunctionWrapper
    @ ~/.julia/packages/FunctionWrappers/8xdVB/src/FunctionWrappers.jl:137 [inlined]
 [12] forward_pass!(compiled_tape::ReverseDiff.CompiledTape{ReverseDiff.GradientTape{Turing.Core.var"#f#26"{DynamicPPL.TypedVarInfo{NamedTuple{(:g, :ρ_block, :ρ_actor, :σ_block, :σ_actor, :z_block, :z_actor), Tuple{DynamicPPL.Metadata{Dict{AbstractPPL.VarName{:g, Tuple{}}, Int64}, Vector{


I noted a few things, which might cause the issue, comparing Stan and my model implemented in Turing

Stan has lkj_corr_cholesky

Stan’s implementation of lkj_corr_cholesky

As per 24 Correlation Matrix Distributions | Stan Functions Reference

The correlation matrix distributions have support on the (Cholesky factors of) correlation matrices. A Cholesky factor L for a K × K correlation matrix Σ of dimension K has rows of unit length so that the diagonal of L L ⊤ is the unit K -vector. Even though models are usually conceptualized in terms of correlation matrices, it is better to operationalize them in terms of their Cholesky factors.

As per 24.1 LKJ Correlation Distribution | Stan Functions Reference, with reference to LKJ distribution (without cholesky factorisation),

However, it is much better computationally to work directly with the Cholesky factor of Σ, so this distribution should never be explicitly used in practice.

As per 24.2 Cholesky LKJ Correlation Distribution | Stan Functions Reference

Stan provides an implicit parameterization of the LKJ correlation matrix density in terms of its Cholesky factor, which you should use rather than the explicit parameterization in the previous section. For example, if L is a Cholesky factor of a correlation matrix, then
L ~ lkj_corr_cholesky(2.0); # implies L * L' ~ lkj_corr(2.0);

parameters{
    .............
    cholesky_factor_corr[4] L_Rho_block;
}
model{
    ...............
    L_Rho_actor ~ lkj_corr_cholesky( 2 );
}

Compared to Turing’s

ρ_actor ~ LKJ(n_treatment, 2)

Building the covariance matrix

I am not sure if my implementation is correct. This is STan’s version

transformed parameters{
    ............
    alpha = (diag_pre_multiply(sigma_actor, L_Rho_actor) * z_actor)';
}

Compared to my Turing’s

# I have no idea what does these two lines do. I just copied from the model of TuringModels.jl above
ρ_actor = (ρ_actor' + ρ_actor) / 2

# Cholesky factor
L_actor = cholesky(ρ_actor).L

# Standardised prior
z_actor ~ filldist(Normal(0, 1), n_treatment, n_actor)

α = σ_actor .* L_actor * z_actor # matrix 4 x 7

Final attempt before I made this post

Given the points above, I suspect the lkj_corr_cholesky is the issue. There is no such thing in Julia or Turing?

So I bumped up the prior of LKJ to be very restrictive but it seems very wrong thing to do. So I changed the priors of LKJ from these

ρ_block ~ LKJ(n_treatment, 2)
ρ_actor ~ LKJ(n_treatment, 2)

to these

ρ_block ~ LKJ(n_treatment, 1_000)
ρ_actor ~ LKJ(n_treatment, 1_000)

then sample again

julia> @time chns = sample(
           rng,
           nonCentered(df.actor, df.block_id, df.treatment, df.pulled_left),
           NUTS(),  # Use Turing's default settings
           1_000
       );
┌ Info: Found initial step size
└   ϵ = 0.025
┌ Warning: The current proposal will be rejected due to numerical error(s).
│   isfinite.((θ, r, ℓπ, ℓκ)) = (true, false, false, false)
└ @ AdvancedHMC ~/.julia/packages/AdvancedHMC/MIxdK/src/hamiltonian.jl:47
┌ Warning: The current proposal will be rejected due to numerical error(s).
│   isfinite.((θ, r, ℓπ, ℓκ)) = (true, false, false, false)
└ @ AdvancedHMC ~/.julia/packages/AdvancedHMC/MIxdK/src/hamiltonian.jl:47
Sampling 100%|██████████████████████████████████| Time: 0:07:48
482.639091 seconds (161.88 M allocations: 30.543 GiB, 1.65% gc time, 4.60% compilation time)

Then the sampling works but i have not validated the inferences. But changing the η from 2 to 1000 just does not seem the right thing to do. Comparing this to Stan’s efficiency (about 6 seconds per chain), here the sampling of 1 chain takes about 7 minutes (although not directly comparable, it’s still many order of magnitues). And I have no idea why changing the η makes the sampling works. Does anybody have any idea?

LKJ Cholesky

Does julia or Turing has implementation of this? I could not find any. Both Stan and PyMC3 have implementations

Stan 24.2 Cholesky LKJ Correlation Distribution | Stan Functions Reference

PyMC3: LKJ Cholesky Covariance Priors for Multivariate Normal Models — PyMC3 3.10.0 documentation

I see many similar Turing issues might be related to this for example

Tracker does not work with multilevel model · Issue #1549 · TuringLang/Turing.jl

(Generalized) Linear Mixed Model Tutorial · Issue #1518 · TuringLang/Turing.jl

How to model Covariance matrices? · Discussion #1526 · TuringLang/Turing.jl

Without LKJ Cholesky it seems to be very hard to do model covariance matrix for multivariate distribution in Turing.

So while waiting for julia or Turing to implement something like LKJ_Cholesky, is there any workaround to sample the the model above? Can I call Stan’s lkj_corr_cholesky from Turing?

Edited and added some comments for section The model I implemented to replicate Stan’s model above

I am looking into defining a custom distribution in Turing as per Advanced Usage - How to Define a Customized Distribution, with references to Stan’s implementation of lkj_corr_cholesky, I found these

stan-dev/math - lkj_corr_cholesky_log.hpp

stan-dev/math - lkj_corr_cholesky_rng.hpp

stan-dev/math - lkj_corr_cholesky_lpdf.hpp

But everything is written in C/Cpp which I have not learned at all. Seems like I am stucked

More information about centered parameterization

This is equivalent to

symmetrize(A) = Symmetric((A .+ A') ./ 2)
quad_form_diag(M, v) = symmetrize(Diagonal(v) * M * Diagonal(v))

but instead of 4 allocations, it does it all in one allocation. The symmetrize operation projects the matrix to the nearest exactly symmetric matrix. This is only necessary because one of the functions within MvNormal (maybe cholesky) performs a very (perhaps overly) strict check for exact symmetry of the covariance matrix. If M is a correlation matrix, and v is a vector of variances, then quad_form_diag produces the covariance matrix. It may not be the best way to do it numerically.

If it helps, these two lines are more-or-less equivalent:

logit_p = broadcast(treatment, actor, block) do t, a, b
    return @inbounds g[t] + α[t, a] + β[t, b]
end
logit_p = @inbounds getindex.(Ref(g), treatment) .+ getindex.(Ref(α), treatment, actor) .+ getindex.(Ref(β), treatment, block)

but in my view the first is much easier to read. We could also do this with map or a for loop. I didn’t check map, but the broadcasted version was faster than the for loop. Which is fastest will be AD-specific.

LKJ cholesky parameterization

You’re absolutely right that something like Stan’s lkj_corr_cholesky should speed things up a lot. This is because MvNormal’s constructor IIRC performs a cholesky decomposition, which is expensive, plus the LKJ density computes the determinant of the correlation matrix, which can be done cheaply from a cholesky factor. Unfortunately Distributions.jl doesn’t have such a distribution, so we can’t use it in Turing. I took a stab at creating a custom distribution, but I’m bad at using Bijectors, so I wasn’t successful. Perhaps one of the other folks here will have better luck. We definitely should have something like this, and I’d advise opening an issue on the Turing.jl repo to discuss it. We can get something working for Soss.jl though.

LKJ cholesky model with Soss

MeasureTheory.jl has an LKJL distribution for the lower triangular cholesky factor of a correlation matrix drawn from the LKJ distribution. It has some issues, but we can try working around them. Soss.jl works with MeasureTheory, so we’ll use Soss.

# using Pkg
# Pkg.add(["Soss", "MeasureTheory", "TuringModels", "CSV", "DataFrames", "DynamicHMC", "LogDensityProblems"])

using Soss, MeasureTheory, LinearAlgebra, Random
using Soss: TransformVariables

# work around LKJL issues, see https://github.com/cscherrer/MeasureTheory.jl/issues/100
# note that we're making LKJL behave like an LKJU
Soss.xform(d::LKJL, _data::NamedTuple=NamedTuple()) = TransformVariables.as(d);
function Base.rand(rng::AbstractRNG, ::Type, d::LKJL{k}) where {k}
    return cholesky(rand(rng, MeasureTheory.Dists.LKJ(k, d.η))).U
end

# get data
using TuringModels, CSV, DataFrames
data_path = joinpath(TuringModels.project_root, "data", "chimpanzees.csv");
df = CSV.read(data_path, DataFrame; delim=';');
df.block_id = df.block;
df.treatment = 1 .+ df.prosoc_left .+ 2 * df.condition;

# build the model
model = @model actor, block, treatment, n_treatment, n_actor, n_block begin
    σ_block ~ Exponential(1) |> iid(n_treatment)
    σ_actor ~ Exponential(1) |> iid(n_treatment)
    U_ρ_block ~ LKJL(n_treatment, 2)
    U_ρ_actor ~ LKJL(n_treatment, 2)
    g ~ Normal(0, 1) |> iid(n_treatment)
    z_block ~ Normal(0, 1) |> iid(n_treatment, n_block)
    z_actor ~ Normal(0, 1) |> iid(n_treatment, n_actor)
    β = σ_block .* (U_ρ_block' * z_block)
    α = σ_actor .* (U_ρ_actor' * z_actor)
    p = broadcast(treatment, actor, block) do t, a, b
        return @inbounds logistic(g[t] + α[t, a] + β[t, b])
    end
    pulled_left .~ Binomial.(1, p)
end

# condition the model on our data
params = (
    actor=df.actor,
    block=df.block,
    treatment=df.treatment,
    n_treatment=4,
    n_actor=7,
    n_block=6,
)
model_cond = model(params) | (pulled_left=df.pulled_left,)

# get ready to run DynamicHMC
using DynamicHMC, LogDensityProblems
ad_backend = Val(:ForwardDiff)
ℓ(x) = Soss.logdensity(model_cond, x)
trans = Soss.xform(model_cond)
P = LogDensityProblems.TransformedLogDensity(trans, ℓ)
∇P = LogDensityProblems.ADgradient(ad_backend, P)

rng = MersenneTwister(3702)
nchains = 4
results = map(1:nchains) do _
    @time mcmc_with_warmup(rng, ∇P, 1_000; reporter=ProgressMeterReport())
end;
# get posterior samples
posterior = map(r -> P.transformation.(r.chain), results)

Note that we’re using ForwardDiff here, so we expect it to be slow, but we still see major speed-ups, running in ~3 minutes/chain. This PR adds ReverseDiff support to LogDensityProblems and brings the runtime down to ~24s/chain. (just swap in ad_backend = Val(:ReverseDiff)). Not quite Stan’s ~6s/chain on my machine, but not bad.

EDIT: MeasureTheory.Binomial has a parameterization in terms of logitp, if we use that instead of calling logistic, with ReverseDiff the runtime drops to ~16s/chain.

You probably also want to check diagnostics and posterior predictions. Here’s how to get everything you need to use ArviZ.jl:

# extract useful statistics
# adapted from https://github.com/arviz-devs/ArviZ.jl/pull/131
sample_stats = map(results) do r
    map(r.chain, r.tree_statistics) do chn, stat
        term = stat.termination
        return (
            lp=LogDensityProblems.logdensity(P, chn),
            energy=stat.π,
            tree_depth=stat.depth,
            acceptance_rate=stat.acceptance_rate,
            n_steps=stat.steps,
            diverging=term.left == term.right,
            turning=term.left < term.right,
        )
    end
end

# collect posterior predictions and log likelihood
varkeys = keys(posterior[1][1])
model_pred = predictive(model, varkeys...)
posterior_predictive_log_likelihood = map(posterior) do post
    map(post) do draw
        pred = rand(rng, model_pred(merge(params, draw)))
        return (
            pulled_left=pred.pulled_left,
            log_likelihood=logdensity.(Binomial.(1, pred.p), df.pulled_left),
        )
    end
end
posterior_predictive = map(posterior_predictive_log_likelihood) do chn
    map(chn) do draw
        (pulled_left=draw.pulled_left,)
    end
end
log_likelihood = map(posterior_predictive_log_likelihood) do chn
    map(chn) do draw
        (pulled_left=draw.log_likelihood,)
    end
end

# gather samples together for analysis with ArviZ
using ArviZ
idata = from_namedtuple(
    posterior;
    sample_stats=sample_stats,
    posterior_predictive=posterior_predictive,
    log_likelihood=log_likelihood,
    observed_data=(pulled_left=df.pulled_left,),
    dims=Dict(
        :σ_block => [:treatment],
        :σ_actor => [:treatment],
        :g => [:treatment],
        :U_ρ_block => [:treatment, :treatment2],
        :U_ρ_actor => [:treatment, :treatment2],
        :z_block => [:treatment, :block],
        :z_actor => [:treatment, :actor],
    ),
    coords=Dict(:treatment => 1:4, :treatment2 => 1:4, :actor => 1:7, :block => 1:6),
)
summarystats(idata)

</details>
3 Likes

Many thanks @sethaxen for helping me out for trying out many variants and finding the workaround in Soss.jl. Soss seems like an interesting package that I should play around too. MeasureTheory.jl seems interesting! Yes Stan’s seems to have already heavily optimized their packages. Their performance is top notch. After all, Stan is still the gold standard in PPL, not just their performance, but also their community, documentation, implementation of many numerically stable math function, and deep ecosystem in R. It’s a great role model for other PPL to learn from.

This numerical issue is quite nasty and I learned many things from you. Many thanks again for the help.

I am closing this issue here to avoid double posting. I have opened issues on Turing.jl to request for LKJCholesky or similar name, which seems to be the root cause.

2 Likes

This is really great, thanks @sethaxen ! It’s exciting to see Soss and MeasureTheory getting performance like this in the wild.

3 Likes

@cscherrer Soss.jl and MeasureTheory.jl seem like great libraries that I should explore. And the syntax similarity between Soss and Turing is another reason.

I have no concepts of measure theory, reading from MeasureTheory.jl

A distribution (as in Distributions.jl) is also called a probability measure , and carries with it the constraint of adding (or integrating) to one. Statistical work usually requires this “at the end of the day”, but enforcing it at each step of a computation can have considerable overhead.

As a generalization of the concept of volume, measures also have applications outside of probability theory.

Measure theory seems like a greater fit for numerical solution like MCMC since it doesn’t have the constraint of distribution – adding (or integrating) to one? Which, in Turing, is done by dealing with the constraint using Bijector.jl?

Do you have recommended resources for learning measure theory? Preferably with Soss.jl and MeasureTheory.jl so I can also explore these two packages at the same time. I think not many people are familiar with measure theory, so it might also be good to provide some recommended resources on your README.md of MeasureTheory.jl to help people get started, i think.

Do you also have recommended resources for me who are familiar with Turing and basic Bayesian statistics (e.g. learned from Statistical Rethinking) to get started on Soss.jl and MeasureTheory.jl?

Thanks in advance

Same here. I’m intrigued by Soss. But it seems to draw on concepts I’m less familiar with and maybe solve problems that I didn’t even know we’re problems.

Turing seems perfect for most of my use cases, though I’ve used Gen for a complex model that required rjmcmc and langevin mc. And I’m really impressed with the speed of Soss in the current example!

I think it would be great to have some informal challenges/completions here on discourse. For example, a user posts an example of a statistical problem that they think is best handled by Soss/Gen/Turing along with their code, benchmark and posterior. Then we see if others can improve on this using the same or different ppl. Improvement is slightly subjective of course as some my prefer a cleaner syntax over a slight speed up. Either way, I think this would be a great learning resource to pick up all the little (undocumented) tricks people are using to supercharge their models. It would also showcase where different elements of the Julia ppl landscape shine.

It’s funny, in Bayesian stats you’ll see a lot of references to the logpdf of the posterior, but that’s almost never correct. The normalization factor for the posterior (the “evidence”) takes some work to get, and we almost never have it at the beginning. So the “pdf” of the posterior as we have it available doesn’t integrate to one. Unless we calculate the normalization factor first, the thing we’re working with is exactly what we’re doing in MeasureTheory. We’re just making it explicit.

This is not quite it. Turing works in terms of Distributions.jl, which forces everything to be normalized. But we know the posterior won’t be normalized, so many of the normalizations in Distributions.jl are just wasted.

There’s a little more to it than this. A lot of the work in MCMC comes from computing gradients. In that case, the normalization goes away, and the Turing team’s DistributionsAD can compute things efficiently. So there’s something else going on here.

Not really. There’s material, but most of what I’ve seen is very abstract. I think especially for this case I just need to write some more. But starting out, it shouldn’t be hard. Just think of it as “non-normalized distributions”.

This is a great idea! My only concern on the Soss/MeasureTheory side is having the time to contribute much. Soss/MeasureTheory doens’t have a team in the way Turing and Gen do, though I’m hoping we can build things up and get more community involvement.

2 Likes

Thanks @cscherrer

1 Like

I noticed when trying your model, that the logit_p calculation might have some type instability, because when I do @code_warntype, it appears red.

I encountered this issue multiple times, and always found it really annoying, not having a good way to deal with this multiple array indexing type thing that is quite frequent whenever you have multiple categorical indexes co-occuring.

But then I encountered CartesianIndex, it seems to good to be true though, but benchmarking it (outside the model at least) seemed significantly better in every respect, min/median/max/mean runtime, memory estimate and allocs estimate.
It really does seem to good to be true, but it also does not appear as type unstable in the model.

The way I did it then is this:

logit_p = g[treatment] .+ α[CartesianIndex.(treatment, actor)] .+ β[CartesianIndex.(treatment, block)]

I just feel like there must be a catch, can anyone confirm?