Working with posteriors

question

#1

Finding that I am writing a lot of repetitive code, I am trying to rethink the way I work with posterior distributions (eg from MCMC).

I can typically reduce most operations on posteriors to transformations, which operate on a particular draw from the posterior, and elementwise operations, eg calculating means and quantiles for various, potentially nested containers, eg an upper-triangular matrix

For transformations, the best container type is an AbstractVector{T} or similar, where T can be a struct or a NamedTuple of various scalars, arrays, etc; transforming to something similar (or a Tuple) one draw at a time. This is easy to program and iterate. Typical examples would be posterior predictive checks, transformed parameters, and summary statistics (eg RMSE and equivalents).

On the other hand, elementwise operations typically work best on something AbstractMatrix-like, where one can map columns using eachcol or mapslices.

One typically wants to convert from one representation to another multiple times.

I am curious how other Julia users approach this, and what existing libraries would be recommended. Packages with somewhat similar (but not identical) goals that I am aware of include



#2

Maybe Queryverse? As I understand it abstracts this concept and aims to easily connect a wide variety of data sources and sinks


#3

Maybe I was not clear (my bad). To make things concrete, consider the mock example

posterior_draws = [(a = i, b = fill(i, 3), c = fill(i, 3, 3)) for i in 1:10]
_flatten(d) = vcat(d.a, d.b, vec(d.c))
posterior_matrix = collect(permutedims(mapreduce(_flatten, hcat, posterior_draws)))

Sometimes you want to work with draws, eg

function do_stuff(d) # operate on posterior draw
    (statistic = d.a + sum(d.c), )
end
map(do_stuff, posterior_draws)

Then sometimes you prefer elementwise:

map(mean, eachcol(posterior_matrix))

My goal is to go back and forth between the two representations in some organized manner. This probably involves defining some kind of schema for the matrix representation’s column layout. I can do this, but wanted to ask before making yet another package.


#4

Is the current problem you’re hitting mostly trouble with performance, or ease of use? Where do you find the packages you linked to missing the mark?


#5

I am going for ease of use. Mosly, I have <10k posterior draws, so programming them is the costly part, not the runtime.

The packages I linked do something similar conceptually, but not what I describe above (or maybe I am using them wrong — in which case rewriting the examples above would be helpful).


#6

I don’t know that this is the best approach, but I’ve usually tried to keep things as an iterable over observations, and then do things like getfield.(posterior_draw, :a) when I need a column.

There seem to be quite a few packages trying to do what you describe or very similar things. DataStreams.jl is another worth checking out, as is IndexedTables.jl that’s used by JuliaDB.

Seems like one potential difficulty is that your columns have some structure - hierarchical, even if only one level of hierarchy.


#7

IndexedTables is “powered” by StructArrays (an IndexedTable is mainly a wrapped StructArray with information on which columns are sorted). Other than the packages you mention (StructArrays and RecursiveArrayTools), I think there is a simple idea that may be worth exploring:
StructArrays accepts arbitrary abstract arrays as columns, so you can combine it with a custom array types that is stored as an N-dimensional array but is an AbstractVector:

using StructArrays
using StructArrays: fieldarrays

struct FakeVector{T, N, M} <: AbstractVector{Array{T, M}}
    array::Array{T, N}
    function FakeVector(array::Array{T, N}) where {T, N}
        new{T, N, N-1}(array)
    end
end

eltypelength(f::FakeVector{T, N, M}) where {T, N, M} = M
Base.size(f::FakeVector) = (size(f.array, 1),)
Base.getindex(f::FakeVector, i) = f.array[i, (Colon() for x in 1:eltypelength(f))...] 
function Base.setindex!(f::FakeVector, val, i)
    f.array[i, (Colon() for _ in 1:eltypelength(f))...] = val
end
reshape2D(v::AbstractArray) = reshape(v, size(v, 1), :)
reshape2D(f::FakeVector) = reshape2D(f.array)

# usage
N = 1000
vals = rand(N)
gradients = FakeVector(rand(N, 3))
hessians = FakeVector(rand(N, 3, 3))
draws = StructArray(vals = vals, gradients = gradients, hessians = hessians)
draws_matrix = mapreduce(reshape2D, hcat, fieldarrays(draws))

# accessing individual arrays:
draws.gradients.array

In short I think that something like the FakeVector type above + StructArrays would solve it and I actually would like a FakeVector type like that. I may go as far as putting it in StructArrays or IndexedTables as it really helps by allowing to use tabular data machinery when working with arrays of more than 1 dimension. Does a package implementing it exist already?


#8

After a bit of experimentation, I found that I can mostly work with vectors of things (NamedTuples, matrices, etc) and map, occasionally manipulating the arrays with @andyferris’s excellent

and using @mauro3’s

to make working with NamedTuple very convenient. A worked example is below.

using SplitApplyCombine, Parameters, Statistics, Distributions, PGFPlotsX

###
### generate data (with fatter tails) and toy example posterior.
###
### (The posterior does not come from any valid inference, to make
### the example self-contained. It is not valid statistical methodology).

data = rand(TDist(3), 100);     # fat tails, will be detected by PPC
posterior = [(μ = randn()/10, σ = rand(Uniform(1.5, 2))) for _ in 1:1000];

# simulate data from posterior
sim = let N = length(data)
    map(posterior) do p
        @unpack μ, σ = p
        rand(Normal(μ, σ), N)
    end
end

function stats(d, p)
    @unpack μ, σ = p
    q5, q25, q75, q95 = quantile(@.((d - μ)/σ), [0.05, 0.25, 0.75, 0.95])
    [q75 - q25, q95 - q5]
end

data_stats = combinedims(map(p -> stats(data, p), posterior))
sim_stats = combinedims(map(stats, sim, posterior))

function makeplot(varlabel, data_stat, sim_stat)
    ab = [extrema(vcat(data_stat, sim_stat))...]
    @pgf Axis({ xlabel = "data " * varlabel, ylabel = "simulated " * varlabel },
              Plot({ blue }, Table(ab, ab)), # 45° line
              Plot({ only_marks, mark = "o", opacity = 0.5 }, Table(data_stat, sim_stat)))
end

makeplot(raw"standardized $q_{75} - q_{25}$", data_stats[1, :], sim_stats[1, :])
makeplot(raw"standardized $q_{95} - q_{5}$", data_stats[2, :], sim_stats[2, :])

These very modular tools make PPC very convenient in Julia. The only missing piece is getting the NamedTuples from the flat matrix data, for which I am working on an experimental package.