Turing.jl coin flipping example with large number of data points

Please forgive my naïveté (I don’t have any formal training in probabilistic programming/Bayesian statistics) but I’m trying to understand something about the coin flipping example in the Turing.jl docs.

Here’s a bare-bones version of their example:

using Distributions
using StatsPlots
using Turing
using MCMCChains
using Random

Random.seed!(12)
data = rand(Bernoulli(0.5), 100)

@model coinflip(y) = begin
    p ~ Beta(1, 1)
    N = length(y)
    for n in 1:N
        y[n] ~ Bernoulli(p)
    end
end

iterations = 1000
ϵ = 0.05
τ = 10

chain = sample(coinflip(data), HMC(ϵ, τ), iterations)

This yields the following:

julia> p_summary = chain[:p]
Object of type Chains, with data of type 1000×1×1 Array{Union{Missing, Real},3}

Iterations        = 1:1000
Thinning interval = 1
Chains            = 1
Samples per chain = 1000
parameters        = p

2-element Array{ChainDataFrame,1}

Summary Statistics

│ Row │ parameters │ mean     │ std       │ naive_se   │ mcse        │ ess     │ r_hat │
│     │ Symbol     │ Float64  │ Float64   │ Float64    │ Float64     │ Any     │ Any   │
├─────┼────────────┼──────────┼───────────┼────────────┼─────────────┼─────────┼───────┤
│ 1   │ p          │ 0.530459 │ 0.0554448 │ 0.00175332 │ 0.000617709 │ 1316.33 │ 0.999 │

Quantiles

│ Row │ parameters │ 2.5%     │ 25.0%   │ 50.0%    │ 75.0%    │ 97.5%    │
│     │ Symbol     │ Float64  │ Float64 │ Float64  │ Float64  │ Float64  │
├─────┼────────────┼──────────┼─────────┼──────────┼──────────┼──────────┤
│ 1   │ p          │ 0.422654 │ 0.49629 │ 0.530401 │ 0.566413 │ 0.635665 │

And the plot looks like this:

p = plot(p_summary, seriestype = :density, xlim = (0,1), legend = :best, w = 2, c = :blue)

test

This is all well and fine and I understand what’s going on. However, if I start to increase the length of the data array, things start to go haywire (in other words, I stop understanding what’s going on :wink:)

Changing the length of the data array from 100 to 100,000 (data = rand(Bernoulli(0.5), 100) becomes data = rand(Bernoulli(0.5), 100_000) yields the following:

julia> p_summary = chain[:p]
.
.
Summary Statistics

│ Row │ parameters │ mean     │ std         │ naive_se   │ mcse    │ ess     │ r_hat   │
│     │ Symbol     │ Float64  │ Float64     │ Float64    │ Float64 │ Any     │ Any     │
├─────┼────────────┼──────────┼─────────────┼────────────┼─────────┼─────────┼─────────┤
│ 1   │ p          │ 0.187243 │ 5.55389e-17 │ 1.7563e-18 │ 0.0     │ 4.01606 │ 1.00289 │

Quantiles

│ Row │ parameters │ 2.5%     │ 25.0%    │ 50.0%    │ 75.0%    │ 97.5%    │
│     │ Symbol     │ Float64  │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼────────────┼──────────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ p          │ 0.187243 │ 0.187243 │ 0.187243 │ 0.187243 │ 0.187243 │

And the plot is just blank since the value of p is constant. Furthermore, if I run this same code several times, I get wildly different results for p.

What’s going on here? I thought that as the amount of data increased, the posterior distribution would simply become taller/skinnier around the true value of 0.5. However, the model does a much better job of estimating p with 100 data points than it does with 100,000…???

2 Likes

The problem is HMC’s (time) step size. When you increase the number of data points, the norm of the gradient of the log joint probability (logp) goes up because your likelihood term is now larger. HMC uses an ODE solver under the hood to simulate Hamiltonian dynamics to find the next proposal. The gradient of logp is basically the equivalent of “force” in dynamics. If the simulation is exact, the acceptance rate (reported in chain) should be equal to 1. However, if the step size is too large, the simulation error can lead to a lower acceptance rate. The error here scales up with the norm of the gradient of logp. So to bring it down, you need to use a smaller step size in the simulation. Imagine simulating a rock falling. If the gravity is too strong, the rock will be falling really fast so you need a small “time” step to be able to accurately simulate your system. So lowering the step size in your example makes it work fine.

For more on HMC, https://arxiv.org/pdf/1206.1901.pdf is a good read.

7 Likes

Thank you :slightly_smiling_face: I know this is a bit of a subjective question but do most people use the NUTS sampler to avoid having to tweak the parameters of the HMC sampler? When I run the code with the NUTS sampler I can just use the same acceptance rate and I get good results, no matter the length of the data array.

2 Likes

That, and bailing from U-turns early.

3 Likes

I love the discussion in this thread - just for fun I ran the code with NUTS() algorithm and I found perfect results. And I agree NUTS sampler is indeed saving the day for HMC.

Note: I used coin flipping simulation size 100,000 and HMC iteration kept the same

Thanks for the nice discussion.
Sourish

chain = sample(coinflip(data), NUTS(), iterations)
Summary Statistics
  parameters      mean       std   naive_se      mcse        ess      rhat   ess_per_sec 
      Symbol   Float64   Float64    Float64   Float64    Float64   Float64       Float64 

           p    0.4976    0.0016     0.0000    0.0001   417.2540    1.0058       86.4775

Quantiles
  parameters      2.5%     25.0%     50.0%     75.0%     97.5% 
      Symbol   Float64   Float64   Float64   Float64   Float64 

           p    0.4945    0.4966    0.4977    0.4986    0.5006