# Something like C++ static declaration for functions in Julia?

Hi everyone!

I was trying to implement a few simple algorithms in Julia, and now have a question about declaring functions static. Minimal example:

function myfun(u::Vector{Float64}, param::NTuple{3,Float64})
A1=#do stuff with u, param
A2=#do more stuff with u, param
return [A1,A2]
end

u0 = [0.0 ,0.0 ,0.0] #initialize vector
param = (1.0, 1.0, 1.0)
h = 0.01
for i = 1:iter
myfun(u0,param)
u0=u0+h
end


Is there a way, similarly to how it can be done in C++ to declare myfun static, to let the compiler know that it does not change, and does not need to be parsed every time? I believe for complicated myfun this could save a lot of memory allocation, however, I am not sure how to implement this in Julia. Perhaps wrapping function in struct could do the trick?

Thank you!

myfun is already compiled exactly once, no matter how may iterations your loop takes. That means that, as far as I can tell, Julia is already doing exactly what you’re asking for and there’s no need for you to do anything special It’s also possible that I’m not understanding exactly what your question means, particularly since there are multiple (largely unrelated) ways to use static in C++. Can you give a concrete example of what you’re trying to accomplish. Even better, do you have a specific example of Julia code that you think should be faster than it is?

7 Likes

I am trying to use RK4 algorithm for solving a system of programmatically generated equations. My code seems to allocate a lot of memory and I cant pinpoint the issue. MWE:

using StaticArrays, DelimitedFiles
@inline function Nodes(u,p::SVector{3,Float64})
A1=((9*((p)^2/2)-p/2*abs2(p))*u-p/2*(p)^2*conj(u) - (p/2)*(2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u + u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u))/im
A2=((4*((p)^2/2)-p/2*abs2(p))*u-p/2*(p)^2*conj(u) - (p/2)*(2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u + u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u))/im
A3=((1*((p)^2/2)-p/2*abs2(p))*u-p/2*(p)^2*conj(u) - (p/2)*(2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u + u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u))/im
A4=((-p/2*abs2(p))*u-p/2*(p)^2*conj(u) - (p/2)*(2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u + u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u))/im
A5=((1*((p)^2/2)-p/2*abs2(p))*u-p/2*(p)^2*conj(u) - (p/2)*(2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u + u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u))/im
A6=((4*((p)^2/2)-p/2*abs2(p))*u-p/2*(p)^2*conj(u) - (p/2)*(2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u + u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u))/im
A7=((9*((p)^2/2)-p/2*abs2(p))*u-p/2*(p)^2*conj(u) - (p/2)*(2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u + u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u))/im

fun=@SVector [A1,	A2,	A3,	A4,	A5,	A6,	A7]
return fun
end


These are equations to be solved. Actual RK4 routine:

@inline function RK4(N::Int64,tspan::Tuple{Float64,Float64},dt::Float64,g::Float64,psi::Float64)

#N is a number of equations in the system.
#p is vector of parameters
k=sqrt(g*psi)
p=[psi,g,k]
p=SVector{3,Float64}(p)

#initial conditions
u1=[0.0+0.0*im for i in 1:N]
u1[Int64((N-1)/2)]=1.0+0.0*im
u1[Int64((N-1)/2)+2]=1.0+0.0*im
u0=SVector{N, Complex{Float64}}(u1)

k1=Vector{ComplexF64}(undef,N)
k2=Vector{ComplexF64}(undef,N)
k3=Vector{ComplexF64}(undef,N)
k4=Vector{ComplexF64}(undef,N)

#Number of iterations
iter=Int64(ceil((t-t0)/dt))

path=pwd()
path1="/my path"
rm(path1,force=true,recursive=true)
mkdir(path1)
cd(path1)
io = open(string(path1,"N=$(N).txt"), "w") for i=1:iter k1.=SVector(dt.*Nodes(u0,p)) k2.=SVector(dt.*Nodes(u0.+k1./2,p)) k3.=SVector(dt.*Nodes(u0.+k2./2,p)) k4.=SVector(dt.*Nodes(u0.+k3,p)) u0=SVector{N,Complex{Float64}}(u0 .+ k1 ./6 .+ 2 .* k2 ./6 .+ 2 .* k3 ./ 6 .+k4 ./ 6) writedlm(io, reshape(u0,1,N), '\t') end close(io) cd(path) return "success" end  When I benchmark this for tspan = (0.0, 100.0) I get 4.7 GB allocations! It looks like Nodes function is the main source of allocation, but I am not sure what can I do to improve this. Thank you! What values should we pass to reproduce your results? You are probably hitting a performance problem where too many arguments to + will cause code to start allocate. So instead of writing a+b+c+d+e+f+... you can write it as (a + b + c) + (d + e + f) + .... and it should be better. There might be a macro that does this… 3 Likes Thank you for your reply! N=7, tspan = (0.0, 100), dt = 0.001, g =16.0, psi = 1.0 A quickfix is to use a different operator than +, e.g. ⊕. It will be parsed as a binary operator, and the many arguments to + is avoided. I.e. const ⊕ = + function Nodes(...) A1 = <same as before, but use ⊕ instead of +> ... end  1 Like Could you explain that a little bit more, please? Why does it work? This works because + has nary parsing, ie a+b+c gets parsed as +(a,b,c), but almost everything else would get parsed to +(a,+(b,c). This is unlikely to help, and may even hurt performance or compilation time. Inlining such a long function (which is presumably only called once) is probably not helpful. Also, have you tried profiling your code? I really like using ProfileView.jl for this, although I think Juno has a built-in profiler too. My guess would be that the writedlm call is pretty expensive, but profiling your function will give concrete data. 1 Like So, the problem is that the function has expressions of the form a+b+c+d+e+f+g+h+.... Julia treats + specially, it’s parsed as an n-ary function +(a,b,c,d,e,f,g,h,...). This has some advantages, for instance for macro developers, but also some disadvantages. As Kristoffer points out, when there are too many arguments (I don’t remember the limit, 20?), things are allocated. This happens for every function with many arguments, not just +. The parsing can be seen by dumping such an expression: julia> dump( :(a + b + c + d) ) Expr head: Symbol call args: Array{Any}((5,)) 1: Symbol + 2: Symbol a 3: Symbol b 4: Symbol c 5: Symbol d  This happens before julia looks at what + actually means, it’s just an infix operator which is parsed to an n-ary function. This is a special treatment for +, not given to other operators (other than *). Unless there are too many arguments, the compiler will optimize away the function calls, and just emit a series of add instructions. If we use ⊕ it looks like: dump( :(a ⊕ b ⊕ c ⊕ d) ) Expr head: Symbol call args: Array{Any}((3,)) 1: Symbol ⊕ 2: Expr head: Symbol call args: Array{Any}((3,)) 1: Symbol ⊕ 2: Expr head: Symbol call args: Array{Any}((3,)) 1: Symbol ⊕ 2: Symbol a 3: Symbol b 3: Symbol c 3: Symbol d  I.e., a series of binary calls ⊕(⊕(⊕(a,b),c),d). This happens before the actual meaning of ⊕ is used by Julia, even if you have actually set ⊕ to mean the same thing as + with the assignment ⊕ = +. Each ⊕ is now a binary call which will easily be optimized away to yield add instructions. You can see the functions used by Julia by giving the commands: @edit 1 + 2 and @edit 1 + 2 + 3 + 4 + 5 and the actual instruction sequence it compiles to with: @code_native 1 + 2 + 3 + 4 + 5 5 Likes Example: using StaticArrays, BenchmarkTools function nodes(u,p::SVector{3,Float64}) A1=((9*((p)^2/2)-p/2*abs2(p))*u-p/2*(p)^2*conj(u) - (p/2)*(2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u + u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u))/im A2=((4*((p)^2/2)-p/2*abs2(p))*u-p/2*(p)^2*conj(u) - (p/2)*(2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u + u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u))/im A3=((1*((p)^2/2)-p/2*abs2(p))*u-p/2*(p)^2*conj(u) - (p/2)*(2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u + u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u))/im A4=((-p/2*abs2(p))*u-p/2*(p)^2*conj(u) - (p/2)*(2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u + u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u))/im A5=((1*((p)^2/2)-p/2*abs2(p))*u-p/2*(p)^2*conj(u) - (p/2)*(2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u + u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u))/im A6=((4*((p)^2/2)-p/2*abs2(p))*u-p/2*(p)^2*conj(u) - (p/2)*(2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u + u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u))/im A7=((9*((p)^2/2)-p/2*abs2(p))*u-p/2*(p)^2*conj(u) - (p/2)*(2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u+2*p*u*conj(u)+conj(p)*u*u + u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u+u*conj(u)*u))/im @SVector [A1, A2, A3, A4, A5, A6, A7] end const ⊕ = Base.FastMath.add_fast # fast versions for fma const ⊖ = Base.FastMath.sub_fast const ⊗ = Base.FastMath.mul_fast const ⨸ = Base.FastMath.div_fast function nodes2(u,p::SVector{3,Float64}) @inbounds begin A1=((9⊗((p)^2⨸2)⊖p⨸2⊗abs2(p))⊗u⊖p⨸2⊗(p)^2⊗conj(u) ⊖ (p⨸2)⊗(2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u⊕2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u⊕2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u⊕2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u ⊕ u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u))⨸im A2=((4⊗((p)^2⨸2)⊖p⨸2⊗abs2(p))⊗u⊖p⨸2⊗(p)^2⊗conj(u) ⊖ (p⨸2)⊗(2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u⊕2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u⊕2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u⊕2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u⊕2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u ⊕ u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u))⨸im A3=((1⊗((p)^2⨸2)⊖p⨸2⊗abs2(p))⊗u⊖p⨸2⊗(p)^2⊗conj(u) ⊖ (p⨸2)⊗(2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u⊕2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u⊕2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u⊕2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u⊕2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u⊕2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u ⊕ u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u))⨸im A4=((-p⨸2⊗abs2(p))⊗u⊖p⨸2⊗(p)^2⊗conj(u) ⊖ (p⨸2)⊗(2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u⊕2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u⊕2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u⊕2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u⊕2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u⊕2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u⊕2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u ⊕ u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u))⨸im A5=((1⊗((p)^2⨸2)⊖p⨸2⊗abs2(p))⊗u⊖p⨸2⊗(p)^2⊗conj(u) ⊖ (p⨸2)⊗(2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u⊕2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u⊕2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u⊕2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u⊕2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u⊕2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u ⊕ u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u))⨸im A6=((4⊗((p)^2⨸2)⊖p⨸2⊗abs2(p))⊗u⊖p⨸2⊗(p)^2⊗conj(u) ⊖ (p⨸2)⊗(2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u⊕2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u⊕2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u⊕2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u⊕2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u ⊕ u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u))⨸im A7=((9⊗((p)^2⨸2)⊖p⨸2⊗abs2(p))⊗u⊖p⨸2⊗(p)^2⊗conj(u) ⊖ (p⨸2)⊗(2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u⊕2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u⊕2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u⊕2⊗p⊗u⊗conj(u)⊕conj(p)⊗u⊗u ⊕ u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u⊕u⊗conj(u)⊗u))⨸im end @SVector [A1, A2, A3, A4, A5, A6, A7] end u = randn(7) .+ im .* randn(7); p = @SVector randn(3); @btime nodes($u, $p) @btime nodes2($u, $p)  This yields: julia> @btime nodes($u, $p) 6.517 μs (312 allocations: 14.14 KiB) 7-element SArray{Tuple{7},Complex{Float64},1,7} with indices SOneTo(7): 0.35349123778295183 - 0.4448327960679225im 0.22931277803291694 - 0.17143777867971866im -0.07926781000844348 + 1.2377712997040806im 0.9817810924131379 - 1.1147273003176812im -1.0868977160196176 + 1.271962169411409im -0.0025775186771048505 + 1.0260457172755868im -0.8600179600224735 - 1.7758789440218923im julia> @btime nodes2($u, \$p)
475.255 ns (0 allocations: 0 bytes)
7-element SArray{Tuple{7},Complex{Float64},1,7} with indices SOneTo(7):
0.3534912377829519 - 0.4448327960679225im
0.22931277803291686 - 0.17143777867971874im
-0.07926781000844349 + 1.2377712997040804im
0.9817810924131378 - 1.1147273003176814im
-1.0868977160196178 + 1.2719621694114087im
-0.0025775186771047377 + 1.026045717275587im
-0.8600179600224735 - 1.7758789440218923im


About 14x faster (and 0 allocations, as folk promised).

The SLP vectorizer failed; this function doesn’t take advantage of SIMD at all.
Odds are you could rewrite it in a friendlier/more obvious manner to the compiler, but this code would have to be really important to you to be worth spending the time to untangle this mess.

3 Likes

Btw, the parsing of + as an n-ary operator is an open issue with the Julia developers, marked as a Julia 2.0 milestone.

https://github.com/JuliaLang/julia/issues/34999

2 Likes

Thanks for this Elrod!

May I ask you what do you mean by SIMD and SLP vectorizer? These equations are generated by the program so some optimizations can be done relatively easily, providing they are worth the effort.

Thank you!

Newer CPUs has what’s called simd-instructions (single instruction multiple data, or avx), they can do multiple arithmetic operations with a single instruction. Like performing a+b, c+d, e+f, and g+h with a single instruction, with the variables loaded into special vector registers. It’s not always obvious to the compiler that this can be done, it requires some analysis. Julia uses LLVM as an intermediate language (i.e. julia code is first translated to llvm, which then compiles to native instructions), and llvm has an optimizer SLP which can emit simd-instructions.

Julia has a macro @simd for doing this more explicitly with loops, and the package LoopVectorization has a macro @avx which does an even better job, but this one requires quite tidy loops.

Specifically, the SLP (super-word level parallelism) vectorizer does not require loops.
For example:

julia> foo((a,b,c,d)) = (a+1,b+2,c+3,d+4)
foo (generic function with 2 methods)

julia> @code_llvm debuginfo=:none foo((1.,2.,3.,4.))


yields

define void @julia_foo_1610([4 x double]* noalias nocapture sret, [4 x double]* nocapture nonnull readonly dereferenceable(32)) {
top:
%2 = getelementptr inbounds [4 x double], [4 x double]* %1, i64 0, i64 0
%3 = bitcast double* %2 to <4 x double>*
%4 = load <4 x double>, <4 x double>* %3, align 8
%5 = fadd <4 x double> %4, <double 1.000000e+00, double 2.000000e+00, double 3.000000e+00, double 4.000000e+00>
%.sroa.0.0..sroa_idx = getelementptr inbounds [4 x double], [4 x double]* %0, i64 0, i64 0
%6 = bitcast double* %.sroa.0.0..sroa_idx to <4 x double>*
store <4 x double> %5, <4 x double>* %6, align 8
ret void
}


Notice how it loads and then fadds <4 x double>s in one-go, rather than individually.
If you can arrange operations in a way so that it’s easy to see 2, 4, or 8x identical operations to apply at a time, that’ll help.
What I mean is that we can see this lined up vertically:

(p)^2*conj(u)
(p)^2*conj(u)
(p)^2*conj(u)
(p)^2*conj(u)
(p)^2*conj(u)
(p)^2*conj(u)
(p)^2*conj(u)


but each of these will actually be placed very far away from one another when your expression is lowered into simple code with one operation per line, meaning this wouldn’t be obvious to the compiler unless it spends an eternity searching through incredible distances for all possibilities.
If you can rearrange things to place like-sequences next to each other, that would probably help both with this, as well as things like out of order execution (a CPU core can work on multiple operations in parallel, but only if they don’t depend on each other; mixing calculations for each of the A* should help, since the separate A* don’t depend on each other, but the operations within for a given A* do).

Can your software also perform CSE (common sub-expression elimination)?
I’m guessing the compiler can do that despite the distance, but I don’t know. Placing like operations together (and turning off bounds checks) will help with that, too.

1 Like