Extracting summary info from multiple MCMCChains programmatically

#1

I am running multiple simulations with Turing and would like to extract some of the summary statistics from each chain. Below, I have created a simple example of what I am trying to achieve. For some reason, I cannot get the summary statistics for each chain. Sometimes it fails on the first chain. Other times it fails on the 9th or 10th. It seems like the summary information is not being computed consistently.

function getSummary(ch)
    chain = ch[1001:end,:,:]
    return chain.info.hashedsummary.x[2].summaries[1].value
end

using Turing 
N=50
θ = .6
k = rand(Binomial(N,θ))
@model model(N,k) = begin
    θ ~ Beta(5,5)
    k~ Binomial(N,θ)
end
Nsamples = 2000
Nadapt = 1000
δ = .85
specs = NUTS(Nsamples,Nadapt,δ)
chain = map(x->sample(model(N,k),specs),1:20)

x = []
for (i,v) in enumerate(chain)
    println(i)
    push!(x,getSummary(v))
end

Error message:

1
2
3
4
5
6
7
8
9
BoundsError: attempt to access 0-element Array{MCMCChains.ChainSummary,1} at index [1]
getindex at array.jl:729 [inlined]
getSummary(::Chains{Union{Missing, Float64},Float64,NamedTuple{(:internals, :parameters),Tuple{Array{String,1},Array{String,1}}},NamedTuple{(:hashedsummary,),Tuple{Base.RefValue{Tuple{UInt64,MCMCChains.ChainSummaries}}}}}) at untitled-30c8ef2a1c6a84d0135df67b323151eb:3
top-level scope at untitled-30c8ef2a1c6a84d0135df67b323151eb:23 [inlined]
top-level scope at none:0

However, the following works:

 chain[9]
  Object of type Chains, with data of type 2000×7×1 Array{Union{Missing, Float64},3}

Log evidence      = 0.0
Iterations        = 1:2000
Thinning interval = 1
Chains            = 1
Samples per chain = 2000
internals         = elapsed, epsilon, eval_num, lf_eps, lf_num, lp
parameters        = θ

parameters
   Mean    SD   Naive SE  MCSE     ESS  
θ 0.6497 0.0635   0.0014 0.0026 603.6478
0 Likes

#2

One workaround involves printing each chain before extracting the summary.

chain = map(x->sample(model(N,k),specs),1:20)
map(x->println(x),chain)
x = []
for (i,v) in enumerate(chain)
    println(i)
    push!(x,getSummary(v))
end

Is there a way to do this without printing values? Do you happen to know @goedman ?

0 Likes

#3

Chris, you mean you simply don’t want to see the printouts?

Chain summaries are cached I believe so they have to be created at least once. But it seems getSummary(v) doesn’t correctly figure this out in your example.

If your answer to above question is yes (don’t show the summaries but make sure they are cached), as a workaround, can you just redirect the print output? E.g.:

TmpDir = mktempdir()
io = open("$TmpDir/myfile.txt", "a");
map(x->print(io, x),chain)
close(io)
rm("$TmpDir/myfile.txt")

x = []
for (i,v) in enumerate(chain)
    println(i)
    push!(x,getSummary(v))
end
0 Likes

#4

Thanks, Rob. My main concern is performance of printing to the console or the hard drive. Its an expensive operation. Right now this is not causing a major issue, but I could see how this might be problematic with larger simulations.

I suppose one solution on my side might be to create my own summary function based on the MCMCChains code and remove the printing. Another might be to create optional printing on the MCMCChains side with a default value of true. I’m not sure if there would be much demand for that functionality.

0 Likes

#5

Let’s see what Cameron thinks. You could write it to an IOBuffer, but it still is just a workaround.

There are also other reasons to have the chain summary programmatically accessible, e.g. for testing and comparisons. Cameron and I have spoken about options such as mean(chain, [:alpha, :sigma]) to make it easier to compute the summary fields, but not about storing and later on retrieving those results.

0 Likes

#6

I like your idea. Something like summary(chain) seems to be a simple solution.

0 Likes

#7

Sorry I missed this when it came out! Would you mind putting stuff like this as issues on the repo? That’ll help me track these better and help other people out when you find stuff like this.

The big problem here is that you’re using the hashedsummary field, as Rob pointed out. Right now the hashsummary is only updated and used by the show function, since it’s an expensive operation. It also has nothing in it to begin with — it’s only created to have something to store in a reference. The only thing I’m curious about is why it only happens every once in a while – this should work either always or never.

I think what you want is to change your function to generate the summary stats on demand (using MCMCChains.summarystats). Your new function should look like

function getSummary(ch)
    chain = ch[1001:end,:,:]
    return MCMCChains.summarystats(chain).summaries[1].value
end

You can also import StatsBase and call the function directly, since summarystats is an overload on top of StatsBase.summarystats:

using StatsBase

function getSummary(ch)
    chain = ch[1001:end,:,:]
    return summarystats(chain).summaries[1].value
end

Honestly, I’m not a big fan of all this ChainSummary stuff. It’s overly difficult to get numbers from, as illustrated here. I wonder if it might be better to overhaul this so you don’t have to do all this stupid indexing (MCMCChains.summarystats(chain).summaries[1].value).

0 Likes

#8

Thanks. What you have proposed is a feasible stopgap solution and exposes my ignorance of the inner workings of MCMCChains. I do agree with your assessment that there might be an easier way to work with summaries. I opened an issue as you requested. Thanks!

1 Like