Speed up very elemental functions

Hey, I am still kind of new to Julia and I have a question about the performance and potential speed up of functions which are very elementar and/or “atomic” in my code, since they are used often and I figured that this would lead to a nice global speedup.

One of these functions would look like this:

function energies_spin(
	kx		::Float64,
	ky		::Float64,
	t		::Float64,
	t2		::Float64,
	t3		::Float64,
	mu		::Float64,
	phase	::Float64,
	o 		::Int64,
	)		::Float64

	return -t* (2*(2*cos(ky*sqrt(3)/2)*cos(1.5*kx+phase*s) + cos(ky*sqrt(3)+phase*s))) -mu

A check with BenchmarkTools for some relevant but kind of random arguments leads to:

julia> @btime energies_spin(1.0,2.0,1.0,0.0,0.0,1.9,0.0,1)
26.432 ns (0 allocations: 0 bytes)

This is of course not a terrible long runtime, but I am still a bit confused since the right hand side of the equations consist of very elemental and simple operations and I would expect that this can be done somewhat faster. Are there any ways to improve this?

It seems it is already very fast.

  • What sort of speedup do you think it is possible in this code?
  • How much does it take to compute a cosine in your machine?

In my case 1 cosine is 7 ns, so the function call taking 17 ns seems already very fast.
Maybe you can make some trigonometric trickery to improve this but it does not seem obvious to me.

julia> function energies_spin(
               kx                ::Float64,
               ky                ::Float64,
               t                ::Float64,
               t2                ::Float64,
               t3                ::Float64,
               mu                ::Float64,
               phase        ::Float64,
               o                 ::Int64,
               )                ::Float64

               return -t* (2*(2*cos(ky*sqrt(3)/2)*cos(1.5*kx+phase*s) + cos(ky*sqrt(3)+phase*s))) -mu
energies_spin (generic function with 1 method)

julia> x = 23

julia> @benchmark cos($x)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  6.750 ns … 11.958 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.042 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.041 ns ±  0.174 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▃  ▆  ▃  ▃ ▂  ▄  █  ▄ ▃  ▃  ▁  ▄ ▃  ▁  ▂  ▁              ▂
  ▅▁█▁▁█▁▁█▁▁█▁█▁▁█▁▁█▁▁█▁█▁▁█▁▁█▁▁█▁█▁▁█▁▁█▁▁█▁█▁▁█▁▁▇▁▁█▁▆ █
  6.75 ns      Histogram: log(frequency) by time     7.62 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @btime energies_spin($1.0,$2.0,$1.0,$0.0,$0.0,$1.9,$0.0,$1)
  16.950 ns (0 allocations: 0 bytes)

The function itself looks pretty basic and hard to squeeze further. But much of the time can go into memory movement. The memory movement here is outside the function, when collecting the parameters to run function with.
If you give us more information about the structure of inputs to this function, maybe a few memory accesses can be saved or a cos calculation or two.
For example, do the kx,ky form regular grids? and such structure.
What is a trace of some exact parameter values used?
How important is the accuracy? (downgrading to Float32 reasonable?)

Note that all these type annotations do nothing performance (use them if you want for clarity, but that makes the code much more bloated.

But as mentioned above, the most important is the context in which this operations are being performed, that is where you can get important speedups.


you can try some fma()

If you’re evaluating this function over many parameters on a grid (or anywhere without loop dependencies), you can try vmap from LoopVectorization.jl:

julia> @btime map(V) do v # built-in `map`
           energies_spin(v, 2.0,1.0,0.0,0.0,1.9,0.0,1)
       end setup=(V = rand(1000));
  19.800 μs (1 allocation: 7.94 KiB)

julia> @btime vmap(V) do v # LoopVectorization
           energies_spin(v, 2.0,1.0,0.0,0.0,1.9,0.0,1)
       end setup=(V = rand(1000))
  2.278 μs (1 allocation: 7.94 KiB)

julia> @btime vmapntt!(out, V) do v # multithreaded, preallocated output
           energies_spin(v, 2.0,1.0,0.0,0.0,1.9,0.0,1)
       end setup=(V = rand(1000); out = zeros(1000));
  576.923 ns (0 allocations: 0 bytes)

Note that you’ll need to remove the argument and return types for LoopVectorization to work, since it uses a special vectorization-compatible packed type for numbers.


You can also try @tturbo or @turbo:

@tturbo out .= energies_spin.(...

It seems to be a bit slower than vmap/vmapntt! for some reason. But still dramatic speedup (20-30x on my 6-core machine.)