Transducers for sufficient statistics

For independent observations from some distribution, we end up doing a lot of

sum(logpdf.(myDist, x))

But in some cases, the algebra simplified a lot, and we can just track sufficient statistics. A common example of this is normal distributions, where we only need (Σx,Σx²). This kind of thing generalizes quite a bit - there’s a whole collection of exponential families that allow this kind of optimization.

I’ve been looking into this a little bit, current progress is here. But I’m a bit stuck on

function iid(d::NatExpFamDist{P,X} ) where {P,X}
    logh(xs :: AbstractArray{<:X}) = sum(d.fam.logh.(xs))
    t(xs :: AbstractArray{<:X}) = reduce((a,b) -> a .+ b,d.fam.t.(xs))
    a(η::P) = d.fam.a(η)

    fam = NaturalExponentialFamily{P,AbstractArray{X}}(logh, t, a)
    NatExpFamDist(fam,d.η)
end

I already need to do some weird reduce tricks, and I also traverse the data twice (a sum and a reduce), which is not great.

This seems like a natural place to use @tkf’s Transducers, but I’m not familiar with it enough yet to see how to go about it. Any ideas?

Even with these issues and without being too careful yet about the typing, the performance is pretty good. So I think there could be a big advantage once it’s cleaned up :slight_smile:

ExponentialFamilies

1 Like

Making some progress, think it will be something like this:

julia> f(a,b) = (a[1]+b[1], a[2]+b[2])
f (generic function with 1 method)

julia> mapfoldl(Zip(Map(identity),Map(x -> x^2)), Completing(f), 1:10, init=(0.0,0.0))
(55.0, 385.0)

It’s great that you find Transducers.jl useful! But I wonder if OnelineStats.jl already does the job out-of-the-box.

You can also convert OnlineStat to a transducer. This could be useful if you want to compose pre-/post-processings that are easier to do in transducers.

FYI it can also be written as

foldl(right, Zip(Scan(+), Map(x -> x^2) |> Scan(+)), 1:10)

For the Normal case yes, but I think I need finer control of the computation, and composability. But OnlineStats looks great, I think it will be useful for other things anyway :slight_smile:

I don’t understand this yet. Is one of them more easily made parallel? I had wondered about rewriting it using mapreduce if that could help performance

Right, Scan does not support parallelism (yet). Supporting Scan in parallel reduce is rather tricky (https://www.youtube.com/watch?v=JCvT9Rnhyvk) but maybe special-casing when the bottom reducing function is right is a good solution. It’s relatively easy to implement and can be done without allocations.

1 Like