Additionally. You can use the compute duration time reported by the MCMCChains object. This is similar to what you are doing with Stan. We are not countiing compilation time for the stan model, so probabliy shouldn’t for Turing either
Good point. But the goal was to find an example problem I could share where the difference was a factor of 700x for example… unfortunately not yet.
Ha! sorry I was getting carried away!
Turing has improved, but in the past I encountered problems with hierarchical models. That might be something to investigate if you want to find a large performance gap.
Also, have you compared Turing and Stan as a function of parameters? The difference might be 2X with 300 parameters, but that does not reveal much about how each scales. What happens if you use 3,000 parameters?
So, this was the key for my “real” model… By expressing the likelihood as something like:
a = view(thing,which1) .- view(thing,which2) + ...
data ~ MvNormal(a,errsize)
Instead of:
for i in 1:obs
data[i] ~ Normal(thing[which1] - thing[which2], errsize)
end
It went from wanting hours to compute to seconds.
I suspect that the “tape” created by the ReverseDiff was where all the time was being spent, but in this kind of calculation above, it generates a vastly simpler gradient.
Thanks to everyone that helped, and for @spinkney this is basically O(1) compared to the Stan time, so I’ve identified the slowdown.
in Julia, normally loops are fast, but when you have to calculate a gradient using AutoDiff, it seems that loops are really NOT fast compared to if you can vectorize that inference.
Great news. I think the issue with loops being slow only applies to reverse mode auto diff.
How does one improve the speed of a model without a Multivariate Gaussian likelihood function? I have a hierarchical multinomial processing tree model with many parameters also. Here are two files containing the code. The first contains supporting functions and the second is a run script.
Is there any alternative to this for loop?
Good question, as an aside:
Turing.setrdcache(true)
Turing.setadbackend(:reversediff)
you might want to do these in the opposite order, I’m not sure if it really matters, but the rdcache only exists if you’re using reversediff I assume? so you’d want to be using it before you turn on the cache? Or maybe it doesn’t matter, but I don’t think it could hurt.
When it comes to your model, I’m not at all clear why Model takes a whole vector of ns ?
You might get away with something like
Turing.@addlogprob! sum(logpdf(Model(ns,r[i],g[i]),data[i]) for i in 1:n)
in which case, the autodiff might realize it’s differentiating a sum rather than something more complicated?
It’s also possible the syntax
data ~ [Model(ns,r[i],g[i]) for i in 1:n]
would work as described Advanced Usage
Thank you for your response. The order of caching and setting the ad back end did not appear to make a difference. The list comprehension was about 2 seconds slower, but the @addlogprob!
statement reduce the run time from about 6.5 seconds to 4.5. According to the documentation, reverse mode AD works best with a multivariate likelihood function, but I don’t think that is applicable here. Moreover, that may not matter so much as how it is implemented under the hood.
This particular model has a product multinomial likelihood function (with binomial being a special case). Each element in ns
is the number of trials for a different multinomial likelihood function. Although I arbitrarily fixed both elements to 100, they could be different in practice and there could be more than 2 elements in the vector. In the actual model I am building, the length of ns
is 12.
A few potential improvements:
# non-allocating sum
function logpdf(dist::Model, data::Vector{Int})
θ = compute_probs(dist.r, dist.g)
return sum(eachindex(dist.ns, data)) do i
return @inbounds logpdf(Binomial(dist.ns[i], θ), data[i])
end
end
# make rand reproducible
function rand(rng::AbstractRNG, dist::Model)
θ = compute_probs(dist.r, dist.g)
return @. rand(rng, Binomial(dist.ns, θ))
end
# in the model, try
data .~ Model.(Ref(ns), r, g)
can we unpack this a little because I don’t know what it means… in his model the “broadcast” needs to be over the data and r,g which are all 3 the same dimension, with ns staying constant. is that why you do Ref(ns) so that it’s a single non-broadcastable thing and the broadcasting doesn’t get confused?
I believe that you are correct. Ref
prevents broadcasting from operating on ns
. I suspect that it prevents an allocation whereas (ns, )
does not.
And were you able to try that code, and did it work? how was the timing?
I don’t understand why the timing would be very different than for example a for loop
for i in 1:length(data)
data[i] ~ Model(ns,r[i],g[i])
end
Except that the autodiff machinery may see these as completely different computations, the broadcasted version having a special case that’s fast, and the loop doing something general and slow.
Thanks. Your recommended changes produced a speed up similar to using @addlogprob
.
I had to modify logpdf
slightly in order for it to work:
function logpdf(dist::Model, data::Vector{Int})
θ = compute_probs(dist.r, dist.g)
return sum(eachindex(dist.ns, data)) do i
@inbounds logpdf(Binomial(dist.ns[i], θ[i]), data[i])
end
end
Just for giggles I tried using Zygote on the version with MvNormal that samples in 5-10 seconds.
It failed with a backtrace:
ERROR: LoadError: MethodError: no method matching (::ChainRulesCore.ProjectTo{Float64, NamedTuple{(), Tuple{}}})(::NamedTuple{(:x,), Tuple{Int64}})
Closest candidates are:
(::ChainRulesCore.ProjectTo{T})(::AbstractFloat) where T<:AbstractFloat at /home/dlakelan/.julia/packages/ChainRulesCore/8vlYQ/src/projection.jl:167
(::ChainRulesCore.ProjectTo)(::ChainRulesCore.Thunk) at /home/dlakelan/.julia/packages/ChainRulesCore/8vlYQ/src/projection.jl:124
(::ChainRulesCore.ProjectTo{<:Number})(::ChainRulesCore.Tangent{<:Number}) at /home/dlakelan/.julia/packages/ChainRulesCore/8vlYQ/src/projection.jl:185
...
Stacktrace:
[1] (::ChainRulesCore.ProjectTo{Ref, NamedTuple{(:type, :x), Tuple{DataType, ChainRulesCore.ProjectTo{Float64, NamedTuple{(), Tuple{}}}}}})(dx::Base.RefValue{Any})
@ ChainRulesCore ~/.julia/packages/ChainRulesCore/8vlYQ/src/projection.jl:275
[2] #56
@ ~/.julia/packages/ChainRulesCore/8vlYQ/src/projection.jl:230 [inlined]
[3] #4
@ ./generator.jl:36 [inlined]
[4] iterate
@ ./generator.jl:47 [inlined]
[5] collect(itr::Base.Generator{Base.Iterators.Zip{Tuple{Vector{ChainRulesCore.ProjectTo{Ref, NamedTuple{(:type, :x), Tuple{DataType, ChainRulesCore.ProjectTo{Float64, NamedTuple{(), Tuple{}}}}}}}, Vector{Base.RefValue{Any}}}}, Base.var"#4#5"{ChainRulesCore.var"#56#57"}})
@ Base ./array.jl:710
[6] map
@ ./abstractarray.jl:2948 [inlined]
That looks like a bug somewhere in the Zygote/ChainRulesCore integration. Can you open an issue on Zygote.jl?
Great to hear! None of this makes any sense to me…yet!
Does turing implement derivatives for any of the distributions or does it rely on autodiff for everything? If it’s the latter then you should compare a distribution in Stan that doesn’t have derivatives or write your own normal in the functions
block (fwiw, multi_normal
does not have derivates, multi_normal_cholesky
does).
Turing currently uses the general-purpose Distributions.jl package. Distributions isn’t always the most AD-friendly package, so the Turing devs maintain DistributionsAD.jl, which (by committing type piracy) tries to smooth over some of the rough spots between Distributions and the AD packages.
I don’t think it implements any specific rules for the distributions used here.
And it’s only slow if you don’t compile the tape. If you do tape-compilation, then ReverseDiff on scalar loops goes a few orders of magnitude faster, hence the question at the start there (mostly to the Turing.jl dev for for whether they have exposed it)
There is nothing that I saw in the manual about tape compilation. When I did one profile I saw a lot of “interpreter” related calls.