Understanding Gen.jl overhead

I’ve been playing around with Gen.jl for a couple of weeks and one of the things I did was compare its performance to a hand-coded model in order to measure the overhead of the underlying trace data structure, the dynamic graph stuff and whatnot.

The example I chose to post here for no particular reason other than simplicity is a Random Walk MH algorithm where our target is a standard normal distribution and the proposal is a scaled uniform distribution around the current state: x_t ~ Uniform(x_{t-1} - d, x_{t-1} + d), with d = 0.25.

no Gen
logpdf = x -> -.5 * x^2

@benchmark begin
nowAt = 0.1
M = 10
trace = Array{Float64, 1}(undef, M)
trace[1] = nowAt
for i = 2:M
    nextMaybe = nowAt + (rand()-.5)/2 
    if logpdf(nextMaybe) - logpdf(nextMaybe) > log(rand())
        nowAt = nextMaybe
    end
    trace[i] = nowAt
end
end

Benchmarking:

BenchmarkTools.Trial: 10000 samples with 63 evaluations.
 Range (min … max):  908.111 ns … 86.429 μs  ┊ GC (min … max): 0.00% … 98.50%
 Time  (median):     977.913 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.156 μs ±  2.299 μs  ┊ GC (mean ± σ):  5.59% ±  2.79%

  ▅▇█▅▄▃▂▁    ▅▆▅▅▄▃▃▂▂▁ ▁               ▁                     ▂
  ██████████▇▇██████████████████▇▇▆▇▇▇▇▆███▇▇▆▅▇▆▆▅▆▆▄▆▇▇▆▄▄▄▃ █
  908 ns        Histogram: log(frequency) by time      2.01 μs <

 Memory estimate: 1.00 KiB, allocs estimate: 55.
yes Gen
@gen function normalModel()
    x ~ normal(0,1)
end;

@gen function proposal(nowAt, d)
    x ~ uniform(nowAt[:x] - d, nowAt[:x] + d)
end;

initTrace, _ = generate(normalModel, ());
@benchmark let nowAt = initTrace
M = 10
trace = Array{Float64, 1}(undef, M)
trace[1] = nowAt[:x]
for i = 2:M
    nowAt, _ = mh(nowAt, proposal, (.25,))
    trace[i] = nowAt[:x]
end
end

Benchmarking:

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  56.838 μs …   7.063 ms  ┊ GC (min … max): 0.00% … 98.10%
 Time  (median):     59.847 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   72.540 μs ± 186.232 μs  ┊ GC (mean ± σ):  7.61% ±  2.95%

  ▆█▆▅▄▃▃▃▂▂▂▂▄▃▃▂▂▁▁▁▁                                        ▂
  ████████████████████████▇██▇█▇████▇████▇▇▇▇█▇▇▆▇▅▆▆▅▅▅▄▅▅▄▅▅ █
  56.8 μs       Histogram: log(frequency) by time       132 μs <

 Memory estimate: 75.45 KiB, allocs estimate: 993.

I get that Gen wasn’t made to be competitive speed wise neither to be used in such a simple algorithm, but is this much of a slowdown to be expected? I tried profiling but can’t properly interpret the results.

How is this overhead related to the complexity of the inference algorithm? Does it increase when using block updates of various kinds, in transdimensional algorithms, etc.?