Hey, I am still kind of new to Julia and I have a question about the performance and potential speed up of functions which are very elementar and/or “atomic” in my code, since they are used often and I figured that this would lead to a nice global speedup.

One of these functions would look like this:

function energies_spin(
kx ::Float64,
ky ::Float64,
t ::Float64,
t2 ::Float64,
t3 ::Float64,
mu ::Float64,
phase ::Float64,
o ::Int64,
) ::Float64
s=-2*o+3
return -t* (2*(2*cos(ky*sqrt(3)/2)*cos(1.5*kx+phase*s) + cos(ky*sqrt(3)+phase*s))) -mu
end

A check with BenchmarkTools for some relevant but kind of random arguments leads to:

This is of course not a terrible long runtime, but I am still a bit confused since the right hand side of the equations consist of very elemental and simple operations and I would expect that this can be done somewhat faster. Are there any ways to improve this?

What sort of speedup do you think it is possible in this code?

How much does it take to compute a cosine in your machine?

In my case 1 cosine is 7 ns, so the function call taking 17 ns seems already very fast.
Maybe you can make some trigonometric trickery to improve this but it does not seem obvious to me.

julia> function energies_spin(
kx ::Float64,
ky ::Float64,
t ::Float64,
t2 ::Float64,
t3 ::Float64,
mu ::Float64,
phase ::Float64,
o ::Int64,
) ::Float64
s=-2*o+3
return -t* (2*(2*cos(ky*sqrt(3)/2)*cos(1.5*kx+phase*s) + cos(ky*sqrt(3)+phase*s))) -mu
end
energies_spin (generic function with 1 method)
julia> x = 23
23
julia> @benchmark cos($x)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 6.750 ns … 11.958 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 7.042 ns ┊ GC (median): 0.00%
Time (mean ± σ): 7.041 ns ± 0.174 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▃ ▆ ▃ ▃ ▂ ▄ █ ▄ ▃ ▃ ▁ ▄ ▃ ▁ ▂ ▁ ▂
▅▁█▁▁█▁▁█▁▁█▁█▁▁█▁▁█▁▁█▁█▁▁█▁▁█▁▁█▁█▁▁█▁▁█▁▁█▁█▁▁█▁▁▇▁▁█▁▆ █
6.75 ns Histogram: log(frequency) by time 7.62 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @btime energies_spin($1.0,$2.0,$1.0,$0.0,$0.0,$1.9,$0.0,$1)
16.950 ns (0 allocations: 0 bytes)
-0.9061275231652673

The function itself looks pretty basic and hard to squeeze further. But much of the time can go into memory movement. The memory movement here is outside the function, when collecting the parameters to run function with.
If you give us more information about the structure of inputs to this function, maybe a few memory accesses can be saved or a cos calculation or two.
For example, do the kx,ky form regular grids? and such structure.
What is a trace of some exact parameter values used?
How important is the accuracy? (downgrading to Float32 reasonable?)
etc.

If you’re evaluating this function over many parameters on a grid (or anywhere without loop dependencies), you can try vmap from LoopVectorization.jl:

julia> @btime map(V) do v # built-in `map`
energies_spin(v, 2.0,1.0,0.0,0.0,1.9,0.0,1)
end setup=(V = rand(1000));
19.800 μs (1 allocation: 7.94 KiB)
julia> @btime vmap(V) do v # LoopVectorization
energies_spin(v, 2.0,1.0,0.0,0.0,1.9,0.0,1)
end setup=(V = rand(1000))
2.278 μs (1 allocation: 7.94 KiB)
julia> @btime vmapntt!(out, V) do v # multithreaded, preallocated output
energies_spin(v, 2.0,1.0,0.0,0.0,1.9,0.0,1)
end setup=(V = rand(1000); out = zeros(1000));
576.923 ns (0 allocations: 0 bytes)

Note that you’ll need to remove the argument and return types for LoopVectorization to work, since it uses a special vectorization-compatible packed type for numbers.