Understanding Gen.jl overhead

ptotolo · February 19, 2022, 3:22pm

I’ve been playing around with Gen.jl for a couple of weeks and one of the things I did was compare its performance to a hand-coded model in order to measure the overhead of the underlying trace data structure, the dynamic graph stuff and whatnot.

The example I chose to post here for no particular reason other than simplicity is a Random Walk MH algorithm where our target is a standard normal distribution and the proposal is a scaled uniform distribution around the current state: x_t ~ Uniform(x_{t-1} - d, x_{t-1} + d), with d = 0.25.

no Gen

logpdf = x -> -.5 * x^2

@benchmark begin
nowAt = 0.1
M = 10
trace = Array{Float64, 1}(undef, M)
trace[1] = nowAt
for i = 2:M
    nextMaybe = nowAt + (rand()-.5)/2 
    if logpdf(nextMaybe) - logpdf(nextMaybe) > log(rand())
        nowAt = nextMaybe
    end
    trace[i] = nowAt
end
end

Benchmarking:

BenchmarkTools.Trial: 10000 samples with 63 evaluations.
 Range (min … max):  908.111 ns … 86.429 μs  ┊ GC (min … max): 0.00% … 98.50%
 Time  (median):     977.913 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.156 μs ±  2.299 μs  ┊ GC (mean ± σ):  5.59% ±  2.79%

  ▅▇█▅▄▃▂▁    ▅▆▅▅▄▃▃▂▂▁ ▁               ▁                     ▂
  ██████████▇▇██████████████████▇▇▆▇▇▇▇▆███▇▇▆▅▇▆▆▅▆▆▄▆▇▇▆▄▄▄▃ █
  908 ns        Histogram: log(frequency) by time      2.01 μs <

 Memory estimate: 1.00 KiB, allocs estimate: 55.

yes Gen

@gen function normalModel()
    x ~ normal(0,1)
end;

@gen function proposal(nowAt, d)
    x ~ uniform(nowAt[:x] - d, nowAt[:x] + d)
end;

initTrace, _ = generate(normalModel, ());
@benchmark let nowAt = initTrace
M = 10
trace = Array{Float64, 1}(undef, M)
trace[1] = nowAt[:x]
for i = 2:M
    nowAt, _ = mh(nowAt, proposal, (.25,))
    trace[i] = nowAt[:x]
end
end

Benchmarking:

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  56.838 μs …   7.063 ms  ┊ GC (min … max): 0.00% … 98.10%
 Time  (median):     59.847 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   72.540 μs ± 186.232 μs  ┊ GC (mean ± σ):  7.61% ±  2.95%

  ▆█▆▅▄▃▃▃▂▂▂▂▄▃▃▂▂▁▁▁▁                                        ▂
  ████████████████████████▇██▇█▇████▇████▇▇▇▇█▇▇▆▇▅▆▆▅▅▅▄▅▅▄▅▅ █
  56.8 μs       Histogram: log(frequency) by time       132 μs <

 Memory estimate: 75.45 KiB, allocs estimate: 993.

I get that Gen wasn’t made to be competitive speed wise neither to be used in such a simple algorithm, but is this much of a slowdown to be expected? I tried profiling but can’t properly interpret the results.

How is this overhead related to the complexity of the inference algorithm? Does it increase when using block updates of various kinds, in transdimensional algorithms, etc.?

Alex_Lew · June 22, 2022, 6:11pm

Hi there! Thanks of trying out Gen.

Gen’s dynamic DSL is useful for prototyping, but if you need the best performance, you should use Gen’s static DSL (Built-in Modeling Language · Gen):

@gen (static) function normalModel()
    x ~ normal(0, 1)
end;

@gen (static) function proposal(nowAt, d)
    current = nowAt[:x]
    x ~ uniform(current - d, current + d)
end;

Gen.@load_generated_functions

initTrace, _ = generate(normalModel, ());
@benchmark let nowAt = initTrace
M = 10
for i = 2:M
    nowAt, _ = mh(nowAt, proposal, (.25,))
end
end

Results:

BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.185 μs … 776.407 μs  ┊ GC (min … max):  0.00% … 99.41%
 Time  (median):     2.310 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   2.747 μs ±  17.114 μs  ┊ GC (mean ± σ):  13.91% ±  2.22%

      ▃▄█▃▂                                                    
  ▂▂▄▆█████▆▆▄▅▄▅▄▅▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂ ▃
  2.19 μs         Histogram: frequency by time        3.14 μs <

 Memory estimate: 4.78 KiB, allocs estimate: 126.

Machines vary, so for comparison, here’s what happens when I run the “no Gen” code:

mylogpdf = x -> -.5 * x^2

@benchmark begin
nowAt = 0.1
M = 10
for i = 2:M
    nextMaybe = nowAt + (rand()-.5)/2 
    if mylogpdf(nextMaybe) - mylogpdf(nextMaybe) > log(rand())
        nowAt = nextMaybe
    end
end
end

Results:

BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.142 μs … 415.492 μs  ┊ GC (min … max): 0.00% … 99.66%
 Time  (median):     1.179 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.244 μs ±   4.144 μs  ┊ GC (mean ± σ):  3.33% ±  1.00%

  ▆█▇▅▅▇▅▅▁▁▃▇▆▇▄▅▂▃▁▂                                        ▂
  ████████████████████▇█▇█▇▇▆█▅█▅█▅▇▅█▄▆▅▅▄▇▄▆▄▆▄▇▅▇▇█▇▇▅▆▅▅▅ █
  1.14 μs      Histogram: log(frequency) by time      1.51 μs <

 Memory estimate: 864 bytes, allocs estimate: 54.

So, there’s still a bit of overhead, but it’s manageable – around 2x.

For more complex models, Gen’s static modeling language automatically applies optimizations that should make it faster than naive hand-coded implementations. Check out, e.g., Table 2 in our paper: https://dl.acm.org/doi/pdf/10.1145/3314221.3314642

As for your question on how the overhead scales with the algorithm you’re implementing: I expect Gen’s dynamic DSL to have constant-factor overhead compared to naive handcoded implementations. So, the high overhead you’re seeing will still be there, but at least the relative overhead will stay constant as the model grows, and shouldn’t blow up. By “naive hardcoded implementation,” I mean e.g. that at each iteration of MH, you evaluate the entire logpdf function, and don’t do any optimizations to quickly evaluate the change in logpdf. (For Gen to automate those optimizations, you need to use the static DSL).

McCoy · July 10, 2022, 2:37pm

Doesn’t logpdf(nextMaybe) - logpdf(nextMaybe) > log(rand()) always accept?

@Alex_Lew / @ptotolo It seems like, in both posts, that the second nextMaybe should be nowAt.

Second, shouldn’t the handcoded (no Gen) implementation include correction terms for the proposal? Otherwise, I don’t think the comparison is strictly valid – one has to churn through more instructions than the other.

Edit: the sampling seems fine, I still think the issue is with evaluating accept/reject.

Edit2: Oh yes, 1 / 2d cancels in the kernel term for both sides…

Right, so maybe just change second nextMaybe to nowAt to correctly compute P / P term.

Topic		Replies	Views
Extended Slack PPL discussion Probabilistic Programming	5	1719	March 28, 2020
Getting off the ground with Gen Probabilistic Programming	1	772	September 14, 2020
What's the difference between Gen and Turing for probabilistic programming? Probabilistic Programming	5	3821	April 18, 2025
Codegen woes Performance	30	2258	August 27, 2019
JuliaCon 2020 Birds of a Feather Probabilistic Programming	24	3270	August 28, 2020

Understanding Gen.jl overhead

Related topics