DataLoader-like interface for Turing?

qft · March 7, 2021, 3:37pm

I am very recently introduced to Turing. In all the example code I see, the model is given a fixed set of observation, and sample a for specific number of iterations. I am wondering, is it possible to use a different subset of the input data for each iteration of sampling?

cscherrer · March 7, 2021, 3:48pm

I think it’s important to separate the abstract sampling algorithm from the particulars of implementation, and how it’s called from a given PPL.

It sounds like you want to do something like a bootstrap, is that right? Those typically use maximum likelihood estimation. Or do you want an MCMC version of this? Do you have any specifics not particular to Turing?

I know ZigZagBoomerang.jl can sample using a subset of the data. @mschauer does it make sense in this context for the sample you use to change across iterations?

qft · March 7, 2021, 4:01pm

Thank you for your quick response. I want to work on a Bayesian neural network using MCMC. Since the input data is quite large, I would like to split the input data into minibatches, and do few (maybe only 1) MCMC samplings using each minibatch. It would be trivial to implement a naive version of this manually, but I feel the performance can be significantly better if Turing can do this automatically.

Thank you for your suggestions on ZigZagBoomerang.jl, I would also look into it.

mschauer · March 7, 2021, 4:03pm

It’s only a bit tricky, so you need to know how many samples there are in total in advance and then you can estimate your likelihood unbiasedly in various fashions, including going throw the data by subsets in random order and repeating. Better ping @SebaGraz

cscherrer · March 7, 2021, 4:03pm

What makes you think that? Typically hand-rolled performance is better, and we use PPLs more for convenience. The only exceptions I know of are in Soss.jl ^* where we can sometimes use symbolic simplifications to transform the log-density into a more efficient form.

^* The only exceptions in Julia, there are examples like Hakaru and Rainier in other languages.

SebaGraz · March 7, 2021, 4:52pm

Thanks, @mschauer. It might be worthed to add that the subsampling scheme for piecewise deterministic Monte Carlo methods (which are implemented in ZigZagBoomerang.jl) is efficient after preprocessing the data (something like Newton’s steps) for finding a mode of the posterior and centring the sampler there. So, somehow, even when doing subsampling, you must look at all your data once. P.S. rethinking of what I just said, you might go somewhere close to a posterior mode if you subsample the gradient in the Newton’s method. So maybe it is not strictly necessary to look at all your data.

ToucheSir · March 7, 2021, 5:53pm

If the ask is for a performant way of iterating over minibatches, why not use GitHub - lorenzoh/DataLoaders.jl: A parallel iterator for large machine learning datasets that don't fit into memory inspired by PyTorch's `DataLoader` class.?

Topic		Replies	Views
Speeding up MCMC for logistic regression with large datasets using Turing or some alternative Statistics question	2	693	March 12, 2020
Recursive Bayes Probabilistic Programming	7	1415	February 11, 2021
Slow Turing.jl sampling compared to python pymc Probabilistic Programming turing	7	124	June 26, 2025
Incremental online learning of Turing.jl models General Usage turing	3	415	June 16, 2023
ANN: Simulated Neural Moments now uses Turing Statistics turing , abc	0	433	February 21, 2022

DataLoader-like interface for Turing?

Related topics