Transducers for sufficient statistics

cscherrer · August 11, 2019, 4:49pm

For independent observations from some distribution, we end up doing a lot of

sum(logpdf.(myDist, x))

But in some cases, the algebra simplified a lot, and we can just track sufficient statistics. A common example of this is normal distributions, where we only need (Σx,Σx²). This kind of thing generalizes quite a bit - there’s a whole collection of exponential families that allow this kind of optimization.

I’ve been looking into this a little bit, current progress is here. But I’m a bit stuck on

function iid(d::NatExpFamDist{P,X} ) where {P,X}
    logh(xs :: AbstractArray{<:X}) = sum(d.fam.logh.(xs))
    t(xs :: AbstractArray{<:X}) = reduce((a,b) -> a .+ b,d.fam.t.(xs))
    a(η::P) = d.fam.a(η)

    fam = NaturalExponentialFamily{P,AbstractArray{X}}(logh, t, a)
    NatExpFamDist(fam,d.η)
end

I already need to do some weird reduce tricks, and I also traverse the data twice (a sum and a reduce), which is not great.

This seems like a natural place to use @tkf’s Transducers, but I’m not familiar with it enough yet to see how to go about it. Any ideas?

Even with these issues and without being too careful yet about the typing, the performance is pretty good. So I think there could be a big advantage once it’s cleaned up

ExponentialFamilies

cscherrer · August 11, 2019, 6:34pm

Making some progress, think it will be something like this:

julia> f(a,b) = (a[1]+b[1], a[2]+b[2])
f (generic function with 1 method)

julia> mapfoldl(Zip(Map(identity),Map(x -> x^2)), Completing(f), 1:10, init=(0.0,0.0))
(55.0, 385.0)

tkf · August 11, 2019, 6:45pm

It’s great that you find Transducers.jl useful! But I wonder if OnelineStats.jl already does the job out-of-the-box.

You can also convert OnlineStat to a transducer. This could be useful if you want to compose pre-/post-processings that are easier to do in transducers.

tkf · August 12, 2019, 1:30am

FYI it can also be written as

foldl(right, Zip(Scan(+), Map(x -> x^2) |> Scan(+)), 1:10)

cscherrer · August 12, 2019, 2:19am

For the Normal case yes, but I think I need finer control of the computation, and composability. But OnlineStats looks great, I think it will be useful for other things anyway

I don’t understand this yet. Is one of them more easily made parallel? I had wondered about rewriting it using mapreduce if that could help performance

tkf · August 12, 2019, 3:19am

Right, Scan does not support parallelism (yet). Supporting Scan in parallel reduce is rather tricky (https://www.youtube.com/watch?v=JCvT9Rnhyvk) but maybe special-casing when the bottom reducing function is right is a good solution. It’s relatively easy to implement and can be done without allocations.

Topic		Replies	Views
Could Transducers help here? Performance performance , transducers	4	998	January 23, 2019
ANN: Transducers.jl, efficient and composable algorithms for map- and reduce-like operations Package Announcements	24	5479	March 25, 2019
ANN: Transducers.jl 0.3. taking "zeros" seriously, type stability improvements, fusible groupby, OnlineStats, "GPU support", and more Package Announcements	29	3223	December 2, 2020
Some questions on Transducers.jl General Usage	6	1016	February 22, 2021
Thread- and process-based parallelisms in Transducers.jl (+ some news) Package Announcements parallel , multithreading , distributed	5	2104	January 15, 2020

Transducers for sufficient statistics

Related topics