Understanding Gen.jl overhead

I’ve been playing around with Gen.jl for a couple of weeks and one of the things I did was compare its performance to a hand-coded model in order to measure the overhead of the underlying trace data structure, the dynamic graph stuff and whatnot.

The example I chose to post here for no particular reason other than simplicity is a Random Walk MH algorithm where our target is a standard normal distribution and the proposal is a scaled uniform distribution around the current state: x_t ~ Uniform(x_{t-1} - d, x_{t-1} + d), with d = 0.25.

no Gen
logpdf = x -> -.5 * x^2

@benchmark begin
nowAt = 0.1
M = 10
trace = Array{Float64, 1}(undef, M)
trace[1] = nowAt
for i = 2:M
    nextMaybe = nowAt + (rand()-.5)/2 
    if logpdf(nextMaybe) - logpdf(nextMaybe) > log(rand())
        nowAt = nextMaybe
    end
    trace[i] = nowAt
end
end

Benchmarking:

BenchmarkTools.Trial: 10000 samples with 63 evaluations.
 Range (min … max):  908.111 ns … 86.429 ΞΌs  β”Š GC (min … max): 0.00% … 98.50%
 Time  (median):     977.913 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):     1.156 ΞΌs Β±  2.299 ΞΌs  β”Š GC (mean Β± Οƒ):  5.59% Β±  2.79%

  β–…β–‡β–ˆβ–…β–„β–ƒβ–‚β–    ▅▆▅▅▄▃▃▂▂▁ ▁               ▁                     β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–‡β–‡β–‡β–‡β–†β–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–…β–‡β–†β–†β–…β–†β–†β–„β–†β–‡β–‡β–†β–„β–„β–„β–ƒ β–ˆ
  908 ns        Histogram: log(frequency) by time      2.01 ΞΌs <

 Memory estimate: 1.00 KiB, allocs estimate: 55.
yes Gen
@gen function normalModel()
    x ~ normal(0,1)
end;

@gen function proposal(nowAt, d)
    x ~ uniform(nowAt[:x] - d, nowAt[:x] + d)
end;

initTrace, _ = generate(normalModel, ());
@benchmark let nowAt = initTrace
M = 10
trace = Array{Float64, 1}(undef, M)
trace[1] = nowAt[:x]
for i = 2:M
    nowAt, _ = mh(nowAt, proposal, (.25,))
    trace[i] = nowAt[:x]
end
end

Benchmarking:

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  56.838 ΞΌs …   7.063 ms  β”Š GC (min … max): 0.00% … 98.10%
 Time  (median):     59.847 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   72.540 ΞΌs Β± 186.232 ΞΌs  β”Š GC (mean Β± Οƒ):  7.61% Β±  2.95%

  β–†β–ˆβ–†β–…β–„β–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–„β–ƒβ–ƒβ–‚β–‚β–β–β–β–                                        β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–‡β–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–‡β–ˆβ–‡β–‡β–†β–‡β–…β–†β–†β–…β–…β–…β–„β–…β–…β–„β–…β–… β–ˆ
  56.8 ΞΌs       Histogram: log(frequency) by time       132 ΞΌs <

 Memory estimate: 75.45 KiB, allocs estimate: 993.

I get that Gen wasn’t made to be competitive speed wise neither to be used in such a simple algorithm, but is this much of a slowdown to be expected? I tried profiling but can’t properly interpret the results.

How is this overhead related to the complexity of the inference algorithm? Does it increase when using block updates of various kinds, in transdimensional algorithms, etc.?

Hi there! Thanks of trying out Gen.

Gen’s dynamic DSL is useful for prototyping, but if you need the best performance, you should use Gen’s static DSL (Built-in Modeling Language Β· Gen):

@gen (static) function normalModel()
    x ~ normal(0, 1)
end;

@gen (static) function proposal(nowAt, d)
    current = nowAt[:x]
    x ~ uniform(current - d, current + d)
end;

Gen.@load_generated_functions

initTrace, _ = generate(normalModel, ());
@benchmark let nowAt = initTrace
M = 10
for i = 2:M
    nowAt, _ = mh(nowAt, proposal, (.25,))
end
end

Results:

BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.185 ΞΌs … 776.407 ΞΌs  β”Š GC (min … max):  0.00% … 99.41%
 Time  (median):     2.310 ΞΌs               β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):   2.747 ΞΌs Β±  17.114 ΞΌs  β”Š GC (mean Β± Οƒ):  13.91% Β±  2.22%

      β–ƒβ–„β–ˆβ–ƒβ–‚                                                    
  β–‚β–‚β–„β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–„β–…β–„β–…β–„β–…β–„β–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚ β–ƒ
  2.19 ΞΌs         Histogram: frequency by time        3.14 ΞΌs <

 Memory estimate: 4.78 KiB, allocs estimate: 126.

Machines vary, so for comparison, here’s what happens when I run the β€œno Gen” code:

mylogpdf = x -> -.5 * x^2

@benchmark begin
nowAt = 0.1
M = 10
for i = 2:M
    nextMaybe = nowAt + (rand()-.5)/2 
    if mylogpdf(nextMaybe) - mylogpdf(nextMaybe) > log(rand())
        nowAt = nextMaybe
    end
end
end

Results:

BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.142 ΞΌs … 415.492 ΞΌs  β”Š GC (min … max): 0.00% … 99.66%
 Time  (median):     1.179 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   1.244 ΞΌs Β±   4.144 ΞΌs  β”Š GC (mean Β± Οƒ):  3.33% Β±  1.00%

  β–†β–ˆβ–‡β–…β–…β–‡β–…β–…β–β–β–ƒβ–‡β–†β–‡β–„β–…β–‚β–ƒβ–β–‚                                        β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–ˆβ–‡β–‡β–†β–ˆβ–…β–ˆβ–…β–ˆβ–…β–‡β–…β–ˆβ–„β–†β–…β–…β–„β–‡β–„β–†β–„β–†β–„β–‡β–…β–‡β–‡β–ˆβ–‡β–‡β–…β–†β–…β–…β–… β–ˆ
  1.14 ΞΌs      Histogram: log(frequency) by time      1.51 ΞΌs <

 Memory estimate: 864 bytes, allocs estimate: 54.

So, there’s still a bit of overhead, but it’s manageable – around 2x.

For more complex models, Gen’s static modeling language automatically applies optimizations that should make it faster than naive hand-coded implementations. Check out, e.g., Table 2 in our paper: https://dl.acm.org/doi/pdf/10.1145/3314221.3314642

As for your question on how the overhead scales with the algorithm you’re implementing: I expect Gen’s dynamic DSL to have constant-factor overhead compared to naive handcoded implementations. So, the high overhead you’re seeing will still be there, but at least the relative overhead will stay constant as the model grows, and shouldn’t blow up. By β€œnaive hardcoded implementation,” I mean e.g. that at each iteration of MH, you evaluate the entire logpdf function, and don’t do any optimizations to quickly evaluate the change in logpdf. (For Gen to automate those optimizations, you need to use the static DSL).

2 Likes

Doesn’t logpdf(nextMaybe) - logpdf(nextMaybe) > log(rand()) always accept?

@Alex_Lew / @ptotolo It seems like, in both posts, that the second nextMaybe should be nowAt.

Second, shouldn’t the handcoded (no Gen) implementation include correction terms for the proposal? Otherwise, I don’t think the comparison is strictly valid – one has to churn through more instructions than the other.

Edit: the sampling seems fine, I still think the issue is with evaluating accept/reject.

Edit2: Oh yes, 1 / 2d cancels in the kernel term for both sides…

Right, so maybe just change second nextMaybe to nowAt to correctly compute P / P term.