Making Turing Fast with large numbers of parameters?

I’ve got a turing model where there are a few parameters for each of a couple hundred objects, plus some structural parameters… so let’s call it close to 500 parameters.

I have a friend who has a simpler model (similar number of parameters, but expected to be less accurate but easier to fit) which fits in Stan, and he can get 100,000 samples in less than a couple minutes. (I told him it was crazy overkill but he figured why not, he’s doing thinning so he only stores 2000 or so but it’s like effective sample size of 2000). Turing with ReverseDiff using NUTS on my computer wants 2 hours to get 600 samples… I enabled the cache and now it wants 5-10 minutes to get 600 samples… this is still hundreds of times slower than Stan.

Are there general recommendations for getting decent speed in Julia/Turing?

my model loops through a dataset and is essentially doing a nonlinear regression.

for i in 1:Nrows
   outcome[i] ~ Normal(somefunction(parameter1[data1[i]],parameter2[data2[i]],someextraparams),somescale)
end

somefunction is nonlinear and shape is determined by someextraparams.

parameter1 and parameter2 are vectors of a few hundred parameters and the data picks out which ones to use.

The vast majority of things i want to do Bayesian inference on fit in to this kind of framework: loop over a bunch of data, and learn a couple hundred to a couple thousand parameters…

The VAST majority of Turing example models I’ve seen are under 10 parameters. Is there some resource for those of us who are dealing with “BIG” models in Turing?

Are you able to compile the reverse tape?

I have no idea what that actually means. What should I be trying?

Here’s what I’m doing:


Turing.setadbackend(:reversediff)
Turing.setrdcache(true)
op = optimize(mymodel,
              MAP(),
              Optim.ParticleSwarm(n_particles=10),
              Optim.Options(iterations=2000))

ch = sample(mymodel,
            NUTS(300,.65,;max_depth=4),600,init_theta=op.values.array) 


It’s finding an initial step size of 2e-5 or so, and after 10 mins and 600 samples I get 1 effective sample… most parameters essentially don’t move from their initial position. Yikes!

A simple MH() type sample take tens of seconds to do 6000 samples and gives ESS of around 10-20

Ok, here’s some serious comparisons. I implemented exactly the same model as my friend. Literally identical in every way, the priors, the likelihood everything.

He runs his model in stan, it has 400+ parameters, it runs in:

Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 8.23303 seconds (Warm-up)
Chain 1:                2.91513 seconds (Sampling)
Chain 1:                11.1482 seconds (Total)

and produces n_eff ranging from 700 to 1900 effective samples (out of 2000 requested)

His step size was on the order of 0.17

Now. I run this in Turing:



Turing.setadbackend(:reversediff)
Turing.setrdcache(true)


startdench = sample(denmodel,
                 MH(diagm([.05 for i in 1:npars])),1200; thinning=30, init_theta=op.values.array)

It chooses a step size of 0.00039

It’s currently sampled 4% of the 1200 samples I asked for, so that’s 48 samples… and it’s estimating it will take 2:20:30 (two hours and twenty minutes)

I can’t show the model, but it’s basically something in a loop like:

data[i] ~ Normal(paramA[dataval[i,1]] - paramB[dataval[i,2]], size)

and there are several hundred unique possible dataval values so there are 400 something parameters exactly like in his model.

Stan is more than 760x faster. Now I could understand if Stan is 2 to 10 x faster… but 760 is crazy.

My question is, is this because of inefficiency in compilation, inefficiency in autodifferentiation, inefficiency in choice of stepsize, or initialization methodology? Stan initializes the model in 8 seconds and samples 2000 samples in 11 seconds! WTH is Turing doing with those 2 hours of computation it wants?

So I installed StatProfilerHTML and did a profile on the first 30 seconds or so of the computation… and it says 80% of the time is spent int poptask and 6% is spent inside the Turing model (macro expansion).

WTH?


Subroutines, sorted by exclusive sample count Exclusive 	Inclusive 	Subroutine
283456 (80.31%) 	283456 (80.31%) 	poptask
21154 (5.99%) 	61551 (17.44%) 	macro expansion
9270 (2.63%) 	9270 (2.63%) 	+
7486 (2.12%) 	7677 (2.18%) 	pull_deriv!
6787 (1.92%) 	6787 (1.92%) 	*
4429 (1.25%) 	4558 (1.29%) 	hasorigin
3939 (1.12%) 	3939 (1.12%) 	setproperty!
2291 (0.65%) 	2291 (0.65%) 	RefValue
2044 (0.58%) 	13008 (3.69%) 	scalar_forward_exec!
1926 (0.55%) 	1926 (0.55%) 	collect
1641 (0.46%) 	1641 (0.46%) 	==
1638 (0.46%) 	39235 (11.12%) 	CallWrapper
1079 (0.31%) 	1079 (0.31%) 	indexed_iterate
769 (0.22%) 	769 (0.22%) 	getproperty
600 (0.17%) 	7464 (2.11%) 	binary_scalar_forward_exec!
523 (0.15%) 	13107 (3.71%) 	forward_exec!
447 (0.13%) 	447 (0.13%) 	TrackedReal
433 (0.12%) 	433 (0.12%) 	/
404 (0.11%) 	2617 (0.74%) 	#662
380 (0.11%) 	380 (0.11%) 	getindex
All subroutines

I imagine it will be tricky to resolve without a MWE of code that’s very slow on Turing but (equivalent code is) fast with Stan.

Another instance of Profiling code reveals that most time is used calling 3 functions in task.jl - #15 by n8xm?

@EvoArt, yeah, I’m going to try to make some minimal thing that will fit fast in stan and slow in Turing. I can’t use this actual model.

I’m guessing something as simple as fitting 300 independent normal distributions to 4-5 samples each. Will try to post this weekend.

@ToucheSir, thanks I will try it with threads=1 see what happens.

Ok I reran with one thread, and got the following:



Subroutines, sorted by exclusive sample count 
Exclusive 	Inclusive 	Subroutine
28422 (79.45%) 	28422 (79.45%) 	_methods_by_ftype
1043 (2.92%) 	13677 (38.23%) 	#step#17
905 (2.53%) 	4156 (11.62%) 	GradientTape
768 (2.15%) 	14862 (41.54%) 	mcmcsample##kw
552 (1.54%) 	552 (1.54%) 	widenconst
449 (1.26%) 	452 (1.26%) 	specialize_method
386 (1.08%) 	386 (1.08%) 	Array
361 (1.01%) 	362 (1.01%) 	_growend!
287 (0.80%) 	13965 (39.04%) 	with_logstate
149 (0.42%) 	2906 (8.12%) 	assume
86 (0.24%) 	729 (2.04%) 	GeneratedFunctionStub
77 (0.22%) 	35770 (99.99%) 	eval
74 (0.21%) 	74 (0.21%) 	maybe_compress_codeinfo
56 (0.16%) 	56 (0.16%) 	getindex
48 (0.13%) 	60 (0.17%) 	cache_lookup
46 (0.13%) 	2376 (6.64%) 	copy
46 (0.13%) 	47 (0.13%) 	retrieve_code_info
45 (0.13%) 	775 (2.17%) 	get_staged
42 (0.12%) 	94 (0.26%) 	InliningTodo
42 (0.12%) 	60 (0.17%) 	iterate
All subroutines

So it looks like 80% of the time is spent in method dispatch? Which probably means I’m doing something stupid with type instability. How do you debug type stability of a Turing model?

OK, I discovered that after reading my friend’s CSV file, it has some missing values in some columns… Even though I’m subsetting the non-missing values in a new DataFrame before passing in the data, the TYPE of the column is still Union{Missing,Int64}. Since I’m modeling those columns as normally distributed, I converted to Float64, and tried running… at least now it’s not spending all its time in _methods_by_ftype

Score 1 for Julia, since profiling it is possible whereas although it’s quite fast, profiling Stan is … harder. So when Stan is slow, it’s quite difficult to figure out why.

I’m also now passing the DataFrame into a function, where I"m creating the Turing model and then doing the sampling, so I’m not relying on any globals…

So now, I’m running a profile after converting the input type to Float64


Subroutines, sorted by exclusive sample count 
Exclusive 	Inclusive 	Subroutine
18147 (22.17%) 	66730 (81.51%) 	macro expansion
11756 (14.36%) 	12078 (14.75%) 	pull_deriv!
9492 (11.59%) 	9492 (11.59%) 	+
8982 (10.97%) 	8982 (10.97%) 	==
5717 (6.98%) 	5882 (7.18%) 	hasorigin
5634 (6.88%) 	5634 (6.88%) 	setproperty!
4867 (5.95%) 	4867 (5.95%) 	*
3202 (3.91%) 	18040 (22.04%) 	scalar_forward_exec!
1966 (2.40%) 	1966 (2.40%) 	indexed_iterate
1050 (1.28%) 	1050 (1.28%) 	getproperty
886 (1.08%) 	81105 (99.07%) 	mcmcsample##kw
699 (0.85%) 	699 (0.85%) 	/
689 (0.84%) 	18180 (22.21%) 	forward_exec!
668 (0.82%) 	668 (0.82%) 	getindex
654 (0.80%) 	9570 (11.69%) 	binary_scalar_forward_exec!
650 (0.79%) 	14383 (17.57%) 	#step#17
583 (0.71%) 	5347 (6.53%) 	unary_scalar_forward_exec!
452 (0.55%) 	893 (1.09%) 	#662
447 (0.55%) 	447 (0.55%) 	RefValue
411 (0.50%) 	411 (0.50%) 	-
All subroutines

At least it seems its spending all its time in something reasonable… It STILL thinks it wants to take 40 minutes to do the work that Stan does in 11 seconds.

1 Like

Stan dev here and I’m quite interested in Turing and what Julia can do. Looking forward to seeing how this gets resolved.

Stan has profile capabilities since 2.26. It’s not quite as easy as the above though. See Profiling Stan programs with CmdStanR • cmdstanr

5 Likes

:+1:

Glad the Stan devs are over here and glad to hear you can profile Stan these days :slight_smile:

Julia has a lot of promise and while i love what Stan can do, the two language problem is real and the ability of Turing to do mixed NUTS-within-Gibbs and discrete parameters is exciting.

I thought perhaps I could come up with a MWE by just trying to fit a large number of simultaneous normal distributions… So I did this code, but nope, it fit within a few tens of seconds

Actually I don’t know what I was thinking… this code takes a variable amount of time… the last time I tried it it was at 2% sample and saying 35 minutes… but after a while it sped up and took overall 157 seconds… Now I guess I gotta try it in Stan.

UGH I just realized I don’t have a working Stan at the moment… perhaps @spinkney could time these two?

using Turing, Distributions, StatsPlots, DataFrames, CSV, Random

seed!(1234)

df = DataFrame(val=randn(5),group=ones(Int,5))

for i in 2:300
    mean = rand(Uniform(0,50))
    vals = rand(Normal(mean,1.0),5)
    indexes = [i for j in 1:5]
    for j in 1:5
        push!(df,[vals[j],indexes[j]])
    end
end

CSV.write("foo.csv",df)



@model function fit300norms(norms,index)

    sd ~ Gamma(2.0,1.0)
    means ~ filldist(Uniform(0,50),300)
    for i in 1:300
        norms[index[i]] ~ Normal(means[index[i]],sd)
    end
end

mymodel = fit300norms(df.val,df.group)
samps = sample(mymodel,NUTS(1000,.65),2000)


Stan version ( now freshly debugged):

data {
  real vals[1500];
  int  ids[1500];
}

parameters {
  real<lower=0,upper=50> means[300];
  real<lower=0> scale;
}

model{

  scale ~ gamma(2.0,1.0);
  for (i in 1:1500) {
      
      vals[i] ~ normal(means[ids[i]],1.0);
      
    }
}

Actually I discovered I’ve got a working rstan… so I’ll try it myself.

Nope, no such luck… after writing the Julia dataset to foo.csv, using this:

library(rstan)
library(readr)

data = read_csv("foo.csv")

standata = list(vals=data$val,ids=data$group)
res = stan(file="stanmodel.stan",data=standata)

gives:


 *** caught segfault ***
address (nil), cause 'unknown'

Traceback:
 1: dyn.load(libLFile)
 2: cxxfunction(sig = sig, body = body, plugin = plugin, includes = includes,     settings = settings, ..., verbose = verbose)
 3: pkgbuild::with_build_tools(cxxfunction(sig = sig, body = body,     plugin = plugin, includes = includes, settings = settings,     ..., verbose = verbose), required = rstan_options("required") &&     !identical(Sys.getenv("WINDOWS"), "TRUE") && !identical(Sys.getenv("R_PACKAGE_SOURCE"),     ""))
 4: cxxfunctionplus(signature(), body = paste(" return Rcpp::wrap(\"",     model_name, "\");", sep = ""), includes = inc, plugin = "rstan",     save_dso = save_dso | auto_write, module_name = paste("stan_fit4",         model_cppname, "_mod", sep = ""), verbose = verbose)
 5: stan_model(file, model_name = model_name, model_code = model_code,     stanc_ret = NULL, boost_lib = boost_lib, eigen_lib = eigen_lib,     save_dso = save_dso, verbose = verbose)
 6: stan(file = "stanmodel.stan", data = standata)
 7: eval(ei, envir)
 8: eval(ei, envir)
 9: withVisible(eval(ei, envir))
10: source("rstan.R")

OK… after a bunch of installing rstan…

Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 3.90344 seconds (Warm-up)
Chain 3:                3.59507 seconds (Sampling)
Chain 3:                7.4985 seconds (Total)
Chain 3: 

That was the longest running chain of 4.

So 157 seconds for Turing, 7.5 for Stan… that seems to be a reasonable benchmark.

EDIT: I’ve been finding and fixing errors/bugs, so still working on getting something that’s a proper benchmark…

Not near a computer but you can remove the loop in the Stan version. It will probably be a few seconds faster.

data {
  real vals[1500];
  int  ids[1500];
}

parameters {
  real<lower=0,upper=50> means[300];
  real<lower=0> scale;
}

model{

  scale ~ gamma(2.0,1.0);
      
   vals ~ normal(means[ids],1.0);
      
}

Hmmm… I’ll give it a try, wasn’t aware you could index an array with an array of indexes :slight_smile:

at the moment I’m fooling with the Julia version, because I realize I didn’t use the ReverseDiff AD backend, so I’m enabling that, as well as the cache, which should help.

Also it seems like Turing and Stan mean different things by total samples. if you say 1000 warmup and 2000 samples in Stan, it means do 1000 warmups and another 1000 samples… whereas for Turing it’s 1000 warmup, and then 2000 kept.

Yeah I get a big speed up with

using Turing,  DataFrames, Random, ReverseDiff
using Memoization
Turing.setrdcache(true)
Turing.setadbackend(:reversediff)

Yep, and also when you fix a bug in the code :slight_smile:

this took 25 seconds or so:

using Pkg
Pkg.activate(".")

using Turing, Distributions, StatsPlots, DataFrames, CSV, Random, ReverseDiff, Memoization

Random.seed!(1234)

df = DataFrame(val=randn(5),group=ones(Int,5))

for i in 2:300
    mean = rand(Uniform(0,50))
    vals = rand(Normal(mean,1.0),5)
    indexes = [i for j in 1:5]
    for j in 1:5
        push!(df,[vals[j],indexes[j]])
    end
end

CSV.write("foo.csv",df)

Turing.setadbackend(:reversediff)
Turing.setrdcache(true)

@model function fit300norms(norms,index)

    sd ~ Gamma(2.0,1.0)
    means ~ filldist(Uniform(0,50),300)
    for i in 1:length(norms)
        norms[i] ~ Normal(means[index[i]],sd)
    end
end

mymodel = fit300norms(df.val,df.group)
t = @time samps = sample(mymodel,NUTS(1000,.65),1000)

t

So I guess not a good benchmark.

1 Like

On my machine, this sped it up a little as well,

@model function fit300norms(norms,index, inorms = norms[index])
    sd ~ Gamma(2.0,1.0)
    means ~ filldist(Uniform(0,50),300)
    inorms ~ MvNormal(view(means, index),sd)
end
3 Likes

Yep, on my machine that’s 14 seconds, so at least it’s clear that you can make Turing be within a factor of 2 or so of Stan on this kind of high dimensional model… so whatever is wrong with the real model is probably … something else.

Are you using the same size data set? in th Stan model, it looks like it expects 1500 data points, not 300?

Also the acceptance rate is set to 0.65 for Turing. is rstan still at it’s default? or did you lower it to match?

Yeah, it was a bug, fixed in this version:

1 Like