Speed up very elemental functions

Djoser42 · November 16, 2022, 11:51am

Hey, I am still kind of new to Julia and I have a question about the performance and potential speed up of functions which are very elementar and/or “atomic” in my code, since they are used often and I figured that this would lead to a nice global speedup.

One of these functions would look like this:

function energies_spin(
	kx		::Float64,
	ky		::Float64,
	t		::Float64,
	t2		::Float64,
	t3		::Float64,
	mu		::Float64,
	phase	::Float64,
	o 		::Int64,
	)		::Float64

	s=-2*o+3
	return -t* (2*(2*cos(ky*sqrt(3)/2)*cos(1.5*kx+phase*s) + cos(ky*sqrt(3)+phase*s))) -mu
end

A check with BenchmarkTools for some relevant but kind of random arguments leads to:

julia> @btime energies_spin(1.0,2.0,1.0,0.0,0.0,1.9,0.0,1)
26.432 ns (0 allocations: 0 bytes)
0.042315672676560556

This is of course not a terrible long runtime, but I am still a bit confused since the right hand side of the equations consist of very elemental and simple operations and I would expect that this can be done somewhat faster. Are there any ways to improve this?

davidbp · November 16, 2022, 12:07pm

It seems it is already very fast.

What sort of speedup do you think it is possible in this code?
How much does it take to compute a cosine in your machine?

In my case 1 cosine is 7 ns, so the function call taking 17 ns seems already very fast.
Maybe you can make some trigonometric trickery to improve this but it does not seem obvious to me.

julia> function energies_spin(
               kx                ::Float64,
               ky                ::Float64,
               t                ::Float64,
               t2                ::Float64,
               t3                ::Float64,
               mu                ::Float64,
               phase        ::Float64,
               o                 ::Int64,
               )                ::Float64

               s=-2*o+3
               return -t* (2*(2*cos(ky*sqrt(3)/2)*cos(1.5*kx+phase*s) + cos(ky*sqrt(3)+phase*s))) -mu
       end
energies_spin (generic function with 1 method)

julia> x = 23
23

julia> @benchmark cos($x)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  6.750 ns … 11.958 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.042 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.041 ns ±  0.174 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▃  ▆  ▃  ▃ ▂  ▄  █  ▄ ▃  ▃  ▁  ▄ ▃  ▁  ▂  ▁              ▂
  ▅▁█▁▁█▁▁█▁▁█▁█▁▁█▁▁█▁▁█▁█▁▁█▁▁█▁▁█▁█▁▁█▁▁█▁▁█▁█▁▁█▁▁▇▁▁█▁▆ █
  6.75 ns      Histogram: log(frequency) by time     7.62 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @btime energies_spin($1.0,$2.0,$1.0,$0.0,$0.0,$1.9,$0.0,$1)
  16.950 ns (0 allocations: 0 bytes)
-0.9061275231652673

Dan · November 16, 2022, 12:35pm

The function itself looks pretty basic and hard to squeeze further. But much of the time can go into memory movement. The memory movement here is outside the function, when collecting the parameters to run function with.
If you give us more information about the structure of inputs to this function, maybe a few memory accesses can be saved or a cos calculation or two.
For example, do the kx,ky form regular grids? and such structure.
What is a trace of some exact parameter values used?
How important is the accuracy? (downgrading to Float32 reasonable?)
etc.

lmiq · November 16, 2022, 1:38pm

Note that all these type annotations do nothing performance (use them if you want for clarity, but that makes the code much more bloated.

But as mentioned above, the most important is the context in which this operations are being performed, that is where you can get important speedups.

jling · November 16, 2022, 2:27pm

you can try some fma()

stillyslalom · November 16, 2022, 2:39pm

If you’re evaluating this function over many parameters on a grid (or anywhere without loop dependencies), you can try vmap from LoopVectorization.jl:

julia> @btime map(V) do v # built-in `map`
           energies_spin(v, 2.0,1.0,0.0,0.0,1.9,0.0,1)
       end setup=(V = rand(1000));
  19.800 μs (1 allocation: 7.94 KiB)

julia> @btime vmap(V) do v # LoopVectorization
           energies_spin(v, 2.0,1.0,0.0,0.0,1.9,0.0,1)
       end setup=(V = rand(1000))
  2.278 μs (1 allocation: 7.94 KiB)

julia> @btime vmapntt!(out, V) do v # multithreaded, preallocated output
           energies_spin(v, 2.0,1.0,0.0,0.0,1.9,0.0,1)
       end setup=(V = rand(1000); out = zeros(1000));
  576.923 ns (0 allocations: 0 bytes)

Note that you’ll need to remove the argument and return types for LoopVectorization to work, since it uses a special vectorization-compatible packed type for numbers.

DNF · November 16, 2022, 3:00pm

You can also try @tturbo or @turbo:

@tturbo out .= energies_spin.(...

It seems to be a bit slower than vmap/vmapntt! for some reason. But still dramatic speedup (20-30x on my 6-core machine.)

Topic		Replies	Views
Speeding up Ewald summation New to Julia performance , memory-allocation , scientific-computing	14	963	May 25, 2022
Making a Function FASTER! New to Julia question	26	2313	November 14, 2020
Could this simple function go faster? Performance	15	1192	November 22, 2020
Running small functions bundled in an outer function taking twice the time of running them separately Performance benchmarktools	7	1046	April 2, 2020
Numpy 10x faster than Julia ?! What am I doing wrong ?! [solved - julia faster now] Performance question	37	10540	October 15, 2019

Speed up very elemental functions

Related topics