Function calls in global scope, benchmarking, etc

I recently asked a Stack Overflow question about interpolating into benchmarking expressions. But I still have some lingering confusion about the impact on performance of passing global variables as arguments to functions. The Julia manual in the Performance Tips section says

A global variable might have its value, and therefore its type, change at any point. This makes it difficult for the compiler to optimize code using global variables. Variables should be local, or passed as arguments to functions, whenever possible.

So it seems like I should be fine performance-wise if I’m passing a global (non-constant) variable as an argument to a function. After all, how am I ever supposed to do anything if I can’t call functions at the top level? However, the answer posted in the above linked Stack Overflow question includes the following example:

julia> x = 0.5; # non-constant global

julia> @btime sin(x);
  20.106 ns (1 allocation: 16 bytes)

julia> @btime sin($x);
  5.413 ns (0 allocations: 0 bytes)

This seems to indicate that performance is impacted even when passing a global as an argument to a function! Or maybe it’s just an idiosyncrasy of the black-box benchmarking macro that requires the interpolation in order to properly set up the benchmark?

In other words, suppose I run the following in the REPL:

julia> x = 0.5;

julia> sin(x);

Will the run-time for the sin(x) call be 20 ns or 5 ns?

The goal of BenchmarkTools is to assess the runtime of a function as though it were written that way inside another function. The trouble is that when you write sin(x), it’s not clear whether x is a global (and slow, see the very first bullet of the [performance tips]) or a local variable. The special $ flag is how BenchmarkTools can distinguish these two cases. The same thing goes when you write sin(0.5) — that 0.5 could be a hard-coded constant (and eligible for constant propagation) or a random choice of a variable.

As far as what the cost will be when running from the REPL, if you can see the difference between 20ns and 5ns, I’ve got a bridge to sell you. The true latency there is more impacted by the amount of time it takes to hit enter, parse, eval, and print than the actual computation.

Sure, 20 ns is nothing, but if I run

julia> x = rand(Int64(1E9));

julia> sin.(x);

then I’ll easily be able to observe a 4x difference in performance. A semi-realistic workflow could look like the following:

julia> my_analysis(x) = 2 .* sin.(x)

julia> x = read_huge_data_vector("/path/to/data/file")  # x is a vector of numbers

julia> my_analysis(x)

So, it sounds like you’re saying that the @btime sin($x) benchmark shows that within my_analysis, each call to sin will take ~5 ns. And yet, I’m not sure how this translates to the total runtime of my_analysis, since my_analysis is passed a non-constant global variable. And since it’s a very large variable, I would certainly notice a 4x difference in performance.

Is the idea here that most of the computationally intensive number crunching is performed by a few kernel functions that are typically nested a few levels deep in the functions that you actually call at the top level?

My (possibly misguided) understanding was that in the following example,

x = 3
foo(x) = 2x

calling foo(x) would compile a version of foo for the signature (::Int64, ). Then when you call foo on a global variable that happens to be an Int64 it should still be fast because it will just call the compiled method for Int64. In other words foo should run on an Int64 in the same amount of time whether it’s called at the top level or inside a function, because either way it’s using the same compiled code for the (::Int64, ) method. Is there something wrong with my reasoning here?

Ah, I see where the disconnect is. The 4x difference you’re seeing in sin isn’t really a 4x multiple. It’s a ~constant 15ns. In that 15ns, Julia looks up the method table for sin, figures out which method to dispatch to, then calls the appropriate sin(::Float64). The remaining 5ns is actually doing the computation of the sine. If you have a heavy duty computation, calling it from global scope isn’t going to make it run 4x slower. It’ll make it run 15ns slower. The time it takes to do dynamic dispatch isn’t dependent upon the size of the arguments or the total computation time — it’s simply a product of the number of methods of the function you’re calling and will be a very small constant overhead.

3 Likes

Great, thanks, that helps a lot! This is a question I’ve had in the back of my head for a long time. Finally I can rest easy. :slight_smile:

A further complication is the use of a constant value as the benchmark input in @btime sin($x). The ‘pseudo-interpolation’ (which is specific to these BenchmarkTools macros by the way) makes it so that x is treated as a compile-time constant, so constant-propagation may occur, altering the result. You can see that, for example, with

julia> x = 2; @btime 2 * $x;
  0.033 ns (0 allocations: 0 bytes)

julia> x = Ref(2); @btime 2 * $x[];
  1.548 ns (0 allocations: 0 bytes)

The first time is less than a CPU clock cycle, so literally no work is being done, as the result 4 is just computed at compile time. This is also the case with sin. Compare:

julia> x = 0.5; @btime sin($x); #1
  3.933 ns (0 allocations: 0 bytes)

julia> @btime sin(0.5); #2
  1.502 ns (0 allocations: 0 bytes)

julia> @btime sin(3.5); #3
  0.033 ns (0 allocations: 0 bytes)

julia> x = Ref(3.5); @btime sin($x[]); #4
  5.752 ns (0 allocations: 0 bytes)

julia> @btime sin(x) setup = (x = rand()); #5
  3.839 ns (0 allocations: 0 bytes)

julia> @btime sin(rand()); # 6
  8.266 ns (0 allocations: 0 bytes)

julia> N = 100_000; x = (4 * pi) .* (rand(N) .- 0.5);

julia> b = @benchmark begin #7
           @inbounds for i in 1 : $N
               sin($x[i])
           end
       end;

julia> minimum(b).time / N # ns
13.5881

Seven dubious ways to benchmark sin.

I don’t fully understand the difference between 1 and 2. However:

  • I think the peculiar difference between 2 and 3 is probably because the particular path that’s being taken through the if statements in the implementation of sin for x = 3.5 is easier for the compiler to statically analyze than the one for x = 0.5, and so the compiler is able to constant-fold / constant-propagate the whole computation.
  • The Ref used in 4 prevents the compiler from seeing the value of x as a constant. However, modern CPUs are very good at branch prediction, so after a while the particular path through the if statements in sin will be predicted correctly every time, resulting in a possible speedup.
  • In 5, the setup argument is used to produce random numbers to pass into sin, but only every few evaluations, so branch prediction may still be an issue. In fact, since rand produces numbers between 0 and 1, it is likely that the same branch is being taken every time anyway (and probably the same one as for sin(0.5), hence the similar result).
  • In 6, a new random number is computed at every evaluation, but now you’re also measuring the time spent in rand.
  • In 7, we’re iterating over 100000 precomputed inputs between -2 * pi and 2 * pi, so we’re hitting more of the branches in sin, without counting the time for random number generation. Increasing N above 100000 doesn’t appreciably change the result, so we can also be fairly certain that the branch predictor is defeated to a certain extent. However, it could be possible for the compiler to use more SIMD instructions due to the for loop (an @noinline function barrier could be used to shield against this). Also, x is still iterated over a number of times (multiple samples used by BenchmarkTools) and it fits into L3 cache (not L2 or L1). Decreasing N so that x does fit into L2 or L1 cache can change the result significantly.

See also this very interesting topic about branch prediction: PSA: Microbenchmarks remember branch history.

So an important question is: which scenario is closest to your use case?

10 Likes

Yikes. I think the moral of the story for me is that I shouldn’t worry too much about small-ish differences in benchmarking times, otherwise I’ll end up going down a very deep rabbit hole.

2 Likes