Metaprogramming to automatically test new versions of code

In writing my code, I often find that I wish to test if a certain function, say f, can be made quicker. One way of doing this is to find a place within my code where the function gets used and run f and some f_quick based on the data in that place of the code, where f_quick is hopefully is quicker. This effectively boils down to running RunCheck(N) in the MWE below.

However, this means I’ll need to manually track occurrences of f in my code and modifying the code there to run the check. A trick I’m using now is the following, which I’ll write up for a function named g to keep the MWE sensible. I rename the original function g to g_slow and the potential new version to g_quick. I then write a function g that runs the checks automatically (and either errors on the first occurrence (non-commented version) or continues if the outputs of g_slow and g_quick agree (commented version)).

This works reasonably well for development, but I find myself in the situation where I want to use this workflow for various functions g, each of which may have different numbers of arguments. So, I was wondering if there is a meta-programming way that, as soon as it finds a function name that ends in _slow, e.g. g_slow, writes a function g that accepts the same inputs and runs the checks/benchmarks I wish.

I hope the below MWE indicates what I wish:

using BenchmarkTools, Random, Distributions
Random.seed!(42)

# Suppose I wrote this simple function and use it throughout my code
f(x, y) = all(x .< y)

# I may think that the following is quicker
f_quick(x, y) = all(xi < yi for (xi, yi) in zip(x, y))

# Manual code for checking: 
function RunCheck(N)
    x = rand(N)
    y = rand(N)
    display(f(x, y) == f_quick(x, y)) # To be certain of this, you'd need to repeat the sampling, of course
    display(@benchmark f($x, $y))
    display(@benchmark f_quick($x, $y))
    return nothing
end

# Instead, I wish to check the performance of the two functions "in context" of my code 
# g(x,y) = all(x .< y) # This is the function I currently use. I rename it to g_slow and write a new function g_quick
g_slow(x, y) = all(x .< y)
g_quick(x, y) = all(xi < yi for (xi, yi) in zip(x, y))
# Now, we define a new g: 
function g(x,y)
    display(g_slow(x, y) == g_quick(x, y))
    display(@benchmark g_slow($x, $y))
    display(@benchmark g_quick($x, $y))
    error()
end
#or a version that can be tried at every occurence
function g(x,y)
    x_slow = g_slow(x, y)
    x_quick = g_quick(x, y)
    display(x_slow == x_quick)
    display("Testing: ")
    display(@benchmark g_slow($x, $y))
    display(@benchmark g_quick($x, $y))
    if x_slow == x_quick
        return x_slow
    else
        display(x_slow)
        display(x_quick)
        error("Mismatch!")
    end
end

function CheckInCode(N)
    x = rand(N)
    y = rand(N)
    z = g(x, y)
    return z
end

RunCheck(1000)
CheckInCode(1000)

gives

true
BenchmarkTools.Trial: 10000 samples with 506 evaluations.
 Range (min … max):  219.121 ns …   5.485 ΞΌs  β”Š GC (min … max): 0.00% … 94.74%
 Time  (median):     226.862 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   242.263 ns Β± 132.021 ns  β”Š GC (mean Β± Οƒ):  4.77% Β±  7.98%

  β–…β–β–ˆβ–†β–„β–…β–ƒβ–‚β–‚β–β–β–β–     ▁                                           β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–†β–…β–…β–…β–…β–…β–„β–ƒβ–…β–…β–ƒβ–β–β–ƒβ–ƒβ–ƒβ–ƒβ–β–ƒβ–β–ƒβ–β–ƒβ–„β–β–β–β–β–β–ƒβ–ƒβ–β–β–ƒβ–β–ƒ β–ˆ
  219 ns        Histogram: log(frequency) by time        388 ns <

 Memory estimate: 224 bytes, allocs estimate: 3.
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  2.791 ns … 27.416 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     2.917 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   2.944 ns Β±  0.383 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   β–† β–‡ β–ˆ β–ˆ β–ƒ β–ƒ                                               β–‚
  β–ˆβ–ˆβ–β–ˆβ–β–ˆβ–β–ˆβ–β–ˆβ–β–ˆβ–β–‡β–β–ˆβ–β–†β–β–„β–β–…β–β–…β–β–ƒβ–β–…β–β–ƒβ–„β–β–ƒβ–β–„β–β–„β–β–ƒβ–β–ƒβ–β–„β–β–„β–β–β–β–ƒβ–β–…β–β–‡β–β–…β–β–†β–† β–ˆ
  2.79 ns      Histogram: log(frequency) by time     4.04 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
true
"Testing: "
BenchmarkTools.Trial: 10000 samples with 501 evaluations.
 Range (min … max):  221.058 ns …   8.514 ΞΌs  β”Š GC (min … max): 0.00% … 96.85%
 Time  (median):     228.958 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   241.355 ns Β± 150.085 ns  β”Š GC (mean Β± Οƒ):  3.70% Β±  6.04%

  β–ƒβ–…   β–†β–ˆβ–„β–β–β–‚β–„β–„β–‚β–β–β–                                             ▁
  β–ˆβ–ˆβ–ˆβ–†β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–†β–‡β–‡β–‡β–‡β–‡β–…β–†β–†β–…β–†β–†β–†β–‡β–‡β–‡β–†β–†β–†β–†β–‡β–‡β–†β–…β–‡β–†β–†β–…β–…β–„β–…β–…β–„β–„β–…β–†β–ƒβ–„ β–ˆ
  221 ns        Histogram: log(frequency) by time        295 ns <

 Memory estimate: 224 bytes, allocs estimate: 3.
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  2.250 ns … 42.750 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     2.375 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   2.430 ns Β±  0.907 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  ▁ β–‡ β–‡ β–ˆ β–…  β–‚ β–‚ ▁                              β–‚      β–‚     β–‚
  β–ˆβ–β–ˆβ–β–ˆβ–β–ˆβ–β–ˆβ–β–β–ˆβ–β–ˆβ–β–ˆβ–β–†β–β–β–‡β–β–‡β–β–†β–β–‡β–β–†β–β–β–†β–β–†β–β–†β–β–…β–β–β–†β–β–…β–β–‡β–β–ˆβ–β–β–ˆβ–β–‡β–β–ˆβ–β–ˆβ–β–‡ β–ˆ
  2.25 ns      Histogram: log(frequency) by time     3.33 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
false

I’m not confident I understood what you’re aiming for exactly, but if you’re @benchmark-ing isolated function calls, you can feed the call inputs into a function that runs the comparative benchmark. One higher-order function instead of metaprogramming multiple benchmarking functions inlining a particularly named pair of functions; what you’re saying would work, but it’s less ergonomic and makes no difference to the @benchmark. Bear in mind that @benchmark-ing a globally scoped call is not the same as a call nested in other function, which may compile further optimizations after inlining, but I expect improved performance to usually translate.

1 Like

Thank you! What I’m thinking is this: I have a big function, let’s call it F. Within this function, I do a lot of different things and call a lot of different functions. Suppose one of these functions, f, seems to be a bit lacking in performance. So, I wish to program a new version f_quick, and I would like to check if it performs better than f. There are two straightforward approaches that came to my mind:

  • Separately from F, run the functions f and f_quick on data that is somewhat representative of the data they will get when they are used inside the function F. This can be a bit of a hassle, especially if the input data are complicated objects (for me, loads of regression coefficients and some other more complicated structs)
  • At each place in F where I call f, benchmark f and f_quick. The problem here is that this involves a lot of manual work: I need to spot places where this happens, insert the relevant code, etc. Now, if I instead want to study a different function g, I need to repeat the process all over again.

So I figured, I might as well leverage the dispatch system:
I rename f to f_slow and write a new function f that, when called, first benchmarks f_slow and f_quick and then returns the output of either method (if they agree, otherwise raise an error). This approach works relatively nicely. However, it still forces me to write a function f that does the benchmarking and testing. If I now wish to test a new function g, I need to repeat the steps: rename g to g_slow, write a new function g that runs g_slow and g_quick etc.

If I understand it correctly, the approach that you suggest would have the same problem: I need to make changes within F (and possibly all function calls therein that use f), which I’d need to repeat if I want to study a different function g.

My question is basically: is there a meta-programming way that, as soon as it sees f_slow as a function, writes a function f that benchmarks f_slow and f_quick, checks if they agree and outputs their result (if they agreed)? I’d then hopefully only have to rename f to f_slow and write f_quick and let metaprogramming + dispatch do the rest of the work!

It’d be even nicer if instead of printing the benchmarking, it would save all the results to a file, to allow for a more thorough analysis!

It’s not meta programming, but you could name the functions old_f and new_f. And then in a testing program do

module old
include("main_file.jl")
const f = old_f
end

module new
include("main_file.jl")
const f = new_f
end

@benchmark old.F()
@benchmark new.F()
4 Likes

I see, I hadn’t thought about that yet. But that would then also require logging everything and comparing logs, to make sure they actually produce the same results, right?

Yeah! Sorry, I thought you were looking just for performance benchmarks, not correctness testing.

If you also want to compare output, then you would need to come up with some system for doing that comparison.

1 Like

That’s exactly what BenchmarkTools.save is for. You can also cache the returned @benchmark object in your own way instead of display-ing it.

I might be failing to grasp something but I don’t see how this helps. It just seems like it complicates the code and imposes a particular pattern of naming and edits. The naming could even be misleading; there’s no guarantee that f_quick would actually be faster than f_slow. I’d prefer to benchmark prototypes of performance critical code in a separate project and move a clear front-runner into my main code. If I absolutely had to switch between multiple implementations in the main code, I’d use higher order functions.

That said, what you’re saying is possible, you could generate your benchmarking code in a macro call @makebenchmark f args provided you already have functions f_slow and f_quick. However, I’d prefer being able to pass any two function names I wish, not derive those from f, and it’d only make the macro call a bit longer.

I see, thank you. I agree that being able to provide any two function names does indeed sound better. I have no experience in julia metaprogramming, how hard would it be to write such a macro?

The general advice is to avoid metaprogramming if it is avoidable.

The suggestions above or similar hacks do not necessary involve any metaprogramming, as I see it. It is possible by adding some macro you could save a few keystrokes, but is it worth it? I’d expect your overall time saving to be substantial – and negative.

On the other side, macroprogramming in Julia is not easy - and exactly for that reason it can be fun and a nice way of procrastination :wink:

2 Likes

I would say the primary reason for avoiding a macro is about whether you’re generating good Julia code. The rule of thumb is that you almost always don’t need to write a new macro. If you can write better code (less redundant, less names added to global scope, etc) to serve the same purpose without metaprogramming, then don’t do it.

As for it’s difficulty, it’s a bit harder than writing source code. At minimum, you need to understand the structure of Expr and how to manipulate it. You can instantiate Expr with the more readable syntax of source code in quoted expressions, insert things with $-interpolation. You may still have to directly mutate it afterward.

Once you figure that out, you have the option of eval-ing expressions into the global scope, generating them in a macro, or generating them in generated functions. Because macros operate earlier in the parsing phase and can transform code in local scopes, you also have to learn macro hygiene. That’s the short version of what metaprogramming entails.

2 Likes