The code below evaluates the standard complex spherical harmonics for a batch of inputs given in terms of cos, sin and the associated legendre functions (which are already precomputed - in the “real” scenario they would be computed first). I assumed that the inner-most loop over the batch of inputs could be made very fast, but not so:

(1) if I use `LoopVectorizations`

and annotate the `for i`

loops with `@avx`

I get

```
┌ Warning: #= /Users/ortner/gits/Polynomials4ML.jl/profile/temp.jl:37 =#:
│ `LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead.
│ Use `warn_check_args=false`, e.g. `@turbo warn_check_args=false ...`, to disable this warning.
└ @ Main ~/.julia/packages/LoopVectorization/k62bU/src/condense_loopset.jl:985
```

Why is that? Is that related to the complex product? It seems that if I rewrite it in terms of real operations and real arrays only then `@avx`

can indeed be used by improves the performance only marginally.

(2) Profiling the code suggests that abou 1/2 to 2/3 of the time is spent in `iterate`

, checking where the `i`

-loop has finished (see screenshot). What is the explanation and can it be prevented? It seems this should not be the bottleneck?

Thank you.

Code:

```
using BenchmarkTools
sizeY(maxL) = (maxL + 1) * (maxL + 1)
index_y(l, m) = m + l + (l*l) + 1
sizeP(maxL) = div((maxL + 1) * (maxL + 2), 2)
index_p(l, m) = m + div(l*(l+1), 2) + 1
function _cYlm!(Y, L, cosθ, sinθ, P, t)
nS = length(cosθ)
@assert size(P) == (nS, sizeP(L))
@assert size(Y) == (nS, sizeY(L))
@assert length(sinθ) == length(t) == nS
fill!(t, 1 / sqrt(2) + im * 0)
@inbounds begin
for l = 0:L
i_yl0 = index_y(l, 0)
i_pl0 = index_p(l, 0)
for i = 1:nS
Y[i, i_yl0] = P[i, i_pl0] * t[i]
end
end
sig = 1
for m in 1:L
sig *= -1
for i = 1:nS
t[i] *= cosθ[i] + im * sinθ[i]
end
for l in m:L
i_plm = index_p(l,m)
i_ylm⁺ = index_y(l, m)
i_ylm⁻ = index_y(l, -m)
for i = 1:nS
p = P[i, i_plm]
Y[i, i_ylm⁻] = (sig * p) * conj(t[i])
Y[i, i_ylm⁺] = t[i] * p
end
end
end
end
return Y
end
##
L = 8
nX = 1024
cosθ = [ cos(π * rand()) for _=1:nX ] # input
sinθ = [ sin(π * rand()) for _=1:nX ]
t = zeros(ComplexF64, nX) # temporary array
P = rand(nX, sizeP(L)) # in the real code, these are the ALPs
Y = zeros(ComplexF64, nX, sizeY(L)) # allocate output
##
@btime _cYlm!($Y, $L, $cosθ, $sinθ, $P, $t)
##
@profview let Y= Y, L=L, cosθ=cosθ, sinθ=sinθ, P=P, t =t
for n = 1:50_000
_cYlm!(Y, L, cosθ, sinθ, P, t)
end
end
```