The code below evaluates the standard complex spherical harmonics for a batch of inputs given in terms of cos, sin and the associated legendre functions (which are already precomputed - in the “real” scenario they would be computed first). I assumed that the inner-most loop over the batch of inputs could be made very fast, but not so:
(1) if I use LoopVectorizations and annotate the for i loops with @avx  I get
┌ Warning: #= /Users/ortner/gits/Polynomials4ML.jl/profile/temp.jl:37 =#:
│ `LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead.
│ Use `warn_check_args=false`, e.g. `@turbo warn_check_args=false ...`, to disable this warning.
└ @ Main ~/.julia/packages/LoopVectorization/k62bU/src/condense_loopset.jl:985
Why is that? Is that related to the complex product? It seems that if I rewrite it in terms of real operations and real arrays only then @avx can indeed be used by improves the performance only marginally.
(2) Profiling the code suggests that abou 1/2 to 2/3 of the time is spent in iterate, checking where the i-loop has finished (see screenshot). What is the explanation and can it be prevented? It seems this should not be the bottleneck?
Thank you.
Code:
using BenchmarkTools
sizeY(maxL) = (maxL + 1) * (maxL + 1)
index_y(l, m) = m + l + (l*l) + 1
sizeP(maxL) = div((maxL + 1) * (maxL + 2), 2)
index_p(l, m) = m + div(l*(l+1), 2) + 1
function _cYlm!(Y, L, cosθ, sinθ, P, t)
	nS = length(cosθ)
	@assert size(P) == (nS, sizeP(L))
	@assert size(Y) == (nS, sizeY(L))
   @assert length(sinθ) == length(t) == nS 
   fill!(t, 1 / sqrt(2) + im * 0)
	@inbounds begin 
		for l = 0:L 
			i_yl0 = index_y(l, 0)
			i_pl0 = index_p(l, 0)
			for i = 1:nS
				Y[i, i_yl0] = P[i, i_pl0] * t[i] 
			end
		end
		sig = 1
		for m in 1:L
			sig *= -1
		   for i = 1:nS
				t[i] *= cosθ[i] + im * sinθ[i]
			end
		   for l in m:L
				i_plm = index_p(l,m)
				i_ylm⁺ = index_y(l,  m)
				i_ylm⁻ = index_y(l, -m)
				for i = 1:nS
					p = P[i, i_plm]
					Y[i, i_ylm⁻] = (sig * p) * conj(t[i])
					Y[i, i_ylm⁺] = t[i] * p  
				end
			end
		end
	end 
	return Y
end
##
L = 8
nX = 1024 
cosθ = [ cos(π * rand()) for _=1:nX ]   # input 
sinθ = [ sin(π * rand()) for _=1:nX ] 
t = zeros(ComplexF64, nX)   # temporary array 
P = rand(nX, sizeP(L)) # in the real code, these are the ALPs
Y = zeros(ComplexF64, nX, sizeY(L))   # allocate output
##
@btime _cYlm!($Y, $L, $cosθ, $sinθ, $P, $t)
## 
@profview let Y= Y, L=L, cosθ=cosθ, sinθ=sinθ, P=P, t =t 
   for n = 1:50_000 
      _cYlm!(Y, L, cosθ, sinθ, P, t)
   end
end
            