Iβve been playing around with Gen.jl for a couple of weeks and one of the things I did was compare its performance to a hand-coded model in order to measure the overhead of the underlying trace data structure, the dynamic graph stuff and whatnot.

The example I chose to post here for no particular reason other than simplicity is a Random Walk MH algorithm where our target is a standard normal distribution and the proposal is a scaled uniform distribution around the current state: x_t ~ Uniform(x_{t-1} - d, x_{t-1} + d), with d = 0.25.

no Gen
``````logpdf = x -> -.5 * x^2

@benchmark begin
nowAt = 0.1
M = 10
trace = Array{Float64, 1}(undef, M)
trace[1] = nowAt
for i = 2:M
nextMaybe = nowAt + (rand()-.5)/2
if logpdf(nextMaybe) - logpdf(nextMaybe) > log(rand())
nowAt = nextMaybe
end
trace[i] = nowAt
end
end
``````

Benchmarking:

``````BenchmarkTools.Trial: 10000 samples with 63 evaluations.
Range (min β¦ max):  908.111 ns β¦ 86.429 ΞΌs  β GC (min β¦ max): 0.00% β¦ 98.50%
Time  (median):     977.913 ns              β GC (median):    0.00%
Time  (mean Β± Ο):     1.156 ΞΌs Β±  2.299 ΞΌs  β GC (mean Β± Ο):  5.59% Β±  2.79%

ββββββββ    ββββββββββ β               β                     β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
908 ns        Histogram: log(frequency) by time      2.01 ΞΌs <

Memory estimate: 1.00 KiB, allocs estimate: 55.
``````
yes Gen
``````@gen function normalModel()
x ~ normal(0,1)
end;

@gen function proposal(nowAt, d)
x ~ uniform(nowAt[:x] - d, nowAt[:x] + d)
end;

initTrace, _ = generate(normalModel, ());
@benchmark let nowAt = initTrace
M = 10
trace = Array{Float64, 1}(undef, M)
trace[1] = nowAt[:x]
for i = 2:M
nowAt, _ = mh(nowAt, proposal, (.25,))
trace[i] = nowAt[:x]
end
end
``````

Benchmarking:

``````BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max):  56.838 ΞΌs β¦   7.063 ms  β GC (min β¦ max): 0.00% β¦ 98.10%
Time  (median):     59.847 ΞΌs               β GC (median):    0.00%
Time  (mean Β± Ο):   72.540 ΞΌs Β± 186.232 ΞΌs  β GC (mean Β± Ο):  7.61% Β±  2.95%

βββββββββββββββββββββ                                        β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
56.8 ΞΌs       Histogram: log(frequency) by time       132 ΞΌs <

Memory estimate: 75.45 KiB, allocs estimate: 993.
``````

I get that Gen wasnβt made to be competitive speed wise neither to be used in such a simple algorithm, but is this much of a slowdown to be expected? I tried profiling but canβt properly interpret the results.

How is this overhead related to the complexity of the inference algorithm? Does it increase when using block updates of various kinds, in transdimensional algorithms, etc.?

Hi there! Thanks of trying out Gen.

Genβs dynamic DSL is useful for prototyping, but if you need the best performance, you should use Genβs static DSL (Built-in Modeling Language Β· Gen):

``````@gen (static) function normalModel()
x ~ normal(0, 1)
end;

@gen (static) function proposal(nowAt, d)
current = nowAt[:x]
x ~ uniform(current - d, current + d)
end;

initTrace, _ = generate(normalModel, ());
@benchmark let nowAt = initTrace
M = 10
for i = 2:M
nowAt, _ = mh(nowAt, proposal, (.25,))
end
end
``````

Results:

``````BenchmarkTools.Trial: 10000 samples with 9 evaluations.
Range (min β¦ max):  2.185 ΞΌs β¦ 776.407 ΞΌs  β GC (min β¦ max):  0.00% β¦ 99.41%
Time  (median):     2.310 ΞΌs               β GC (median):     0.00%
Time  (mean Β± Ο):   2.747 ΞΌs Β±  17.114 ΞΌs  β GC (mean Β± Ο):  13.91% Β±  2.22%

βββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
2.19 ΞΌs         Histogram: frequency by time        3.14 ΞΌs <

Memory estimate: 4.78 KiB, allocs estimate: 126.
``````

Machines vary, so for comparison, hereβs what happens when I run the βno Genβ code:

``````mylogpdf = x -> -.5 * x^2

@benchmark begin
nowAt = 0.1
M = 10
for i = 2:M
nextMaybe = nowAt + (rand()-.5)/2
if mylogpdf(nextMaybe) - mylogpdf(nextMaybe) > log(rand())
nowAt = nextMaybe
end
end
end
``````

Results:

``````BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min β¦ max):  1.142 ΞΌs β¦ 415.492 ΞΌs  β GC (min β¦ max): 0.00% β¦ 99.66%
Time  (median):     1.179 ΞΌs               β GC (median):    0.00%
Time  (mean Β± Ο):   1.244 ΞΌs Β±   4.144 ΞΌs  β GC (mean Β± Ο):  3.33% Β±  1.00%

ββββββββββββββββββββ                                        β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
1.14 ΞΌs      Histogram: log(frequency) by time      1.51 ΞΌs <

Memory estimate: 864 bytes, allocs estimate: 54.
``````

So, thereβs still a bit of overhead, but itβs manageable β around 2x.

For more complex models, Genβs static modeling language automatically applies optimizations that should make it faster than naive hand-coded implementations. Check out, e.g., Table 2 in our paper: https://dl.acm.org/doi/pdf/10.1145/3314221.3314642

As for your question on how the overhead scales with the algorithm youβre implementing: I expect Genβs dynamic DSL to have constant-factor overhead compared to naive handcoded implementations. So, the high overhead youβre seeing will still be there, but at least the relative overhead will stay constant as the model grows, and shouldnβt blow up. By βnaive hardcoded implementation,β I mean e.g. that at each iteration of MH, you evaluate the entire logpdf function, and donβt do any optimizations to quickly evaluate the change in logpdf. (For Gen to automate those optimizations, you need to use the static DSL).

2 Likes

Doesnβt `logpdf(nextMaybe) - logpdf(nextMaybe) > log(rand())` always accept?

@Alex_Lew / @ptotolo It seems like, in both posts, that the second `nextMaybe` should be `nowAt`.

Second, shouldnβt the handcoded (no Gen) implementation include correction terms for the proposal? Otherwise, I donβt think the comparison is strictly valid β one has to churn through more instructions than the other.

Edit: the sampling seems fine, I still think the issue is with evaluating accept/reject.

Edit2: Oh yes, `1 / 2d` cancels in the kernel term for both sidesβ¦

Right, so maybe just change second `nextMaybe` to `nowAt` to correctly compute `P / P` term.