The impact of different ways of passing parameters on the speed of function computation

JianchengSuo · August 18, 2024, 4:46pm

I have a doubt and I will use a simple polynomial function as an example to describe it.

Defining a regular polynomial function, the speed of direct value calculation is very fast (0.8ns):

f1(x,y)=x^3*y^3+2*x^2*y+3*x*y^2
@benchmark f1(1.0,2.0)

BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  0.791 ns … 12.333 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     0.875 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   0.874 ns ±  0.128 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                        █                     
  ▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▂
  0.791 ns       Histogram: frequency by time       0.917 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

However, when the values are passed as parameters or extracted from an array, the calculation speed is much slower (17ns and 51ns):

a = 1.0
b = 2.0
@benchmark f1(a,b)

BenchmarkTools.Trial: 10000 samples with 998 evaluations.
 Range (min … max):  16.366 ns … 36.281 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     17.577 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   17.706 ns ±  1.295 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▃▃▁▃▄▂▅██▅▃                                                ▂
  ▇████████████▇▇▅▅▅▆▆▅▆▄▄▄▄▄▆▅▆██▇▇▄▆▅▄▅▄▃▃▃▁▅▄▄▄▁▃▁▁▃▁▃▁▄▄▃ █
  16.4 ns      Histogram: log(frequency) by time      24.4 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.

c=[1.0,2.0]
@benchmark f1(c[1],c[2])

BenchmarkTools.Trial: 10000 samples with 990 evaluations.
 Range (min … max):  45.833 ns … 202.441 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     49.706 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   51.062 ns ±   5.727 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▁█ ▃▂                                                
  ▄▂▁▁▁▁▁▂▃██▆██▄▃▃▂▃▃▃▃▂▂▂▁▁▁▁▂▂▂▃▂▂▁▁▁▁▁▁▂▂▂▁▁▁▁▁▁▁▁▁▁

So when I want to extract data from an array and pass it into this function for processing to obtain a new function, the computational speed of the new function will be slow:

arr=[1.0,1.0,1.0,1.0,1.0]

function f2(y)
    res=0.0
    for i=1:5
        res+=f1(arr[i],y)
    end
return res
end    

@benchmark f2(2.0)

BenchmarkTools.Trial: 10000 samples with 362 evaluations.
 Range (min … max):  248.848 ns … 126.508 μs  ┊ GC (min … max): 0.00% … 99.75%
 Time  (median):     266.804 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   286.499 ns ±   1.264 μs  ┊ GC (mean ± σ):  4.88% ±  2.70%

       █                                                         
  ▃▄▂▂▂█▅▁▁▃▅█▇▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  249 ns           Histogram: frequency by time          341 ns <

 Memory estimate: 320 bytes, allocs estimate: 20.

You can see that the computational speed of obtaining the new function f2 by looping 5 times is 280ns, which is approximately 5 times the time it takes to pass each array element value to the calculation of f1 (51ns).

For the simple function mentioned above, the speed difference between various ways of passing parameters may not be significant. However, when dealing with a very complex polynomial function, the difference in computational speed between direct calculation and passing parameters/extracting array elements as parameters can be extremely large (ranging from 1ns to approximately 700μs)：

Therefore, when I use the complex function to loop through and extract array data in order to define a new function, the speed is not ideal.

My first question is about the main source of this speed difference. I speculate that it may be due to the optimization triggered when functions are directly passed constants, but I am not clear on the specific principles and why the speed difference is so significant.
My second question is, since direct algebraic function calculations are fast, is there an optimization method to convert the calculations of extracting array elements in the loop in the example defining f2 into similar direct algebraic calculations? Or is this fundamentally impossible to achieve?

I look forward to everyone’s answers and greatly appreciate it!

GunnarFarneback · August 18, 2024, 5:49pm

Partial answer:

When you see a non-trivial computation taking sub-nanosecond time, your first expectation should be that it has been pre-computed. Specifically Julia uses constant-propagation when it can do so:

julia> f1(x,y)=x^3*y^3+2*x^2*y+3*x*y^2
f1 (generic function with 1 method)

julia> f2() = f1(1.0, 2.0)
f2 (generic function with 1 method)

julia> @code_llvm f2()
;  @ REPL[10]:1 within `f2`
define double @julia_f2_191() #0 {
top:
  ret double 2.400000e+01
}

So all computation takes place during compilation and at runtime it just returns the answer.

stevengj · August 18, 2024, 6:53pm

You need to interpolate global variables, since otherwise global variables are slow:

julia> @btime f1(a, b); # non-interpolated globals
  15.406 ns (1 allocation: 16 bytes)

julia> @btime f1($1.0,$2.0);
  1.958 ns (0 allocations: 0 bytes)

julia> @btime f1($a, $b);
  1.959 ns (0 allocations: 0 bytes)

This is an artifact of benchmarking in global scope. When you actually use a function like f1(a,b) in practice, it will presumably be inside a function if you care about performance, in which case there is no issue with global variables.

With arguments taken from an array, you still need to interpolate for benchmarking if it is a global array:

julia> @btime f1(c[1],c[2]);   # non-interpolated global array = slow
  59.596 ns (3 allocations: 48 bytes)

julia> @btime f1($c[1],$c[2]);
  2.375 ns (0 allocations: 0 bytes)

but, even when the array c is interpolated (removing the global-variable penalty), there is still a tiny bit of time required to fetch the elements from an array.

As @GunnarFarneback pointed out above, with constant arguments f1(1.0,2.0) may be faster because when the arguments are constants the compiler can sometimes inline the whole function and evaluate it (to 24.0 in this case) at compile time, which is why f1(1.0,2.0) is a bit faster still:

julia> @btime f1(1.0,2.0);
  1.125 ns (0 allocations: 0 bytes)

JianchengSuo · August 19, 2024, 6:10am

Thank you for your reply, I saw the probably relevant instructions in the document[Manual · BenchmarkTools.jl]：

Can I understand that the determination of f1(1.0,2.0) and f(&a,&b) is not the actual calculation speed of the function?

JianchengSuo · August 19, 2024, 6:34am

Hello! Thank you very much for your help! I understand that the interpolate global variables can save time on reading variables, but I want to know if this evaluation result reflects the actual calculation time of the function：

julia> @btime f1($c[1],$c[2]);
  2.375 ns (0 allocations: 0 bytes)

Is this the actual computation speed of the function or is it a false speed completed during compilation, just like substituting constants?
Or, in other words, how can I determine the actual speed of an expression function when it is applied?

stevengj · August 19, 2024, 4:08pm

In this case the compiler won’t do constant propagation, but you are timing the cost of the function including the cost of fetching the elements from the array, which is different from the cost of the function if you have local variables.

In general, when you are talking about timing functions at the nanosecond level, the precise cost depends heavily on the context — the time for a sequence of calculations is not simply the sum of the times for them measured in isolation. Computers are complicated.

JianchengSuo · August 20, 2024, 7:05am

Thank you very much, I think I understand now. So for an expression function f(x):

@btime f(1)  
# constant-propagation;
#measuring only the time taken to return the result (worhless);

@btime f($1) / @btime f($c[1]) 
#won't do constant propagation;
#measuring time for extracting elements and function computation.

If this understanding is correct, thank you all again for your help！

DNF · August 20, 2024, 7:15am

This depends on what you mean by “an expression function”. Constant propagation cannot happen for arbitrary functions, but is possible or likely (but not guaranteed) for some simple functions.

JianchengSuo · August 20, 2024, 7:37am

I mean ordinary mathematical polynomials without complex computations，just like the function I am using：

（long but simple）

DNF · August 20, 2024, 7:39am

Wow. I dunno, you’ll just have to try with @code_native f(1) or @code_llvm f(1)

JianchengSuo · August 20, 2024, 7:58am

I used @code_native in the way mentioned in the first post, does this means that “constant propagation”?

DNF · August 20, 2024, 8:07am

Hm, not sure. Did you include the entire output? It looks like something was cut off at the end.

But it also looks like it is indeed getting constant propagated. Do the two numbers at the end,

double 7465...
double -60049...

agree with the numerical output from the funtion?

(BTW, it’s much more convenient if you copy and paste the text rather than posting a screenshot. See: Please read: make it easier to help you)

JianchengSuo · August 20, 2024, 8:32am

yes, the two numbers agree with the numerical output.

f()
7.465609860530714e10 - 6.004930916410957e10im

Actually, the reason I initially thought there was a “constant propagation” is because there was a significant difference in the results of these two speed tests:

@btime pamp8(2.4,0.0,
    0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
    1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,
    0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0)
0.791 ns (0 allocations: 0 bytes)
7.465609860530714e10 - 6.004930916410957e10im

@btime pamp8($2.4,0.0,
    0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
    1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,
    0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0)
28.916 μs (0 allocations: 0 bytes)
7.465609860530714e10 - 6.004930916410957e10im

(Thanks for your advice! I tend to forget it sometimes…)

Topic		Replies	Views
Parsing expressions into functions Performance	22	1902	June 11, 2019
Function control flow Performance question , performance	22	1520	May 31, 2020
Achieving C++ Speeds for a CFD Algorithm: Julia spends too much time on pre-allocation? Performance	22	3633	November 28, 2018
Why does exporting a single Float64 from a function take as long as 2,000,000 addition steps Performance	21	1165	May 9, 2019
Anonymous function as argument cant be optimized General Usage performance	5	845	June 20, 2018

The impact of different ways of passing parameters on the speed of function computation

Related topics